SlideShare a Scribd company logo
1 of 22
Download to read offline
Map Reduce along with
Amazon EMR
Sampath Rachakonda & Siva Krishna Battu
Bigdata Analytics on Cloud Meetup
14th March 2015
http://www.meetup.com/abctalks
Agenda
http://www.meetup.com/abctalks
 Introduction to BigData and Hadoop
MapReduce
 Core Hadoop and its Ecosystem
 Use Cases
 Hadoop Installation on Windows/Ubuntu
 Work flow on Map Reduce
 M-R on EMR
 Real-time examples on EMR
 What's Next ? Hadoop 2.0!
Introduction to BigData
http://www.meetup.com/abctalks
 It is the latest buzz but on the other hand data is an opportunity
BigData – What & How
http://www.meetup.com/abctalks
Contd..
http://www.meetup.com/abctalks
 Extremely large datasets that are hard to deal with using
relational databases
 Storage/Cost
 Search/Performance
 Analytics and Visualization
 Need for parallel processing on hundreds of machines
 ETL cannot complete within a reasonable amount of time
 Beyond 24 hours never catch up
http://www.meetup.com/abctalks
Solution to handle BigData
http://www.meetup.com/abctalks
 Distributed File System
 System shall manage and heal itself
 Automatically route around failure
 Speculatively execute redundant tasks based on
performance
 Performance Scale Linearly
 Proportional change in capacity with resource change
 Compute should move to data
 Lower Latency, Lower Bandwidth
Introduction Apache Hadoop
http://www.meetup.com/abctalks
 What is Hadoop ?
A scalable fault tolerant grid operating system for data storage and
processing.
 Open Source, Apache License
 Works with Structured and Unstructured Data
 HDFS: Fault-Tolerant high-bandwidth clustered storage
 Commodity Hardware
 Master (name-node) – Slave Architecture
 MapReduce : Distributed Data Processing
Hadoop Cluster
http://www.meetup.com/abctalks
 A set of “cheap” commodity hardware
 Networked together
 Resides in same location in set of
racks in a data centre
 No super computers, use commodity
unreliable hardware
Hadoop System Principles
http://www.meetup.com/abctalks
 Scale-Out rather than Scale-Up
 Bring Code to data rather than Data to Code
 Deal with failures - they are common
 Abstract complexity of distributed and concurrent applications
RDBMS Vs Hadoop
http://www.meetup.com/abctalks
 Before hadoop many applications
used RDBMS for batch processing
like Oracle, MySQL, Sybase, etc..
 Hadoop doesn't fully replace RDBMS
the architecture
 RDBMS products Scale-up rather
than Scale-Out with limitations of
100s of terabytes
 Structured Vs Unstructured
 Offline Batch Vs Online Transactions
Hadoop + RDBMS Complements each other
http://www.meetup.com/abctalks
 For example a small website with small number of users
generating large amount of audit logs :
 WebServer (1)--> RDBMS --> (2)&(4) --> Hadoop(3)
 Use RDBMS for rich user interface and enforce data
integrity
 RDBMS generates lots of audit logs; the logs are moved
periodically to hadoop cluster
 All logs are kept & processed in Hadoop for various
analytics
 Results from hadoop cluster are stored back onto RDBMS
to be used by web server. Ex: Suggestions based on audit
history
Hadoop Eco System
http://www.meetup.com/abctalks
 Hadoop mainly comprised of two
core components :
 HDFS(Hadoop Distributed File
System) to store data & process
data
 MapReduce(Distributed data
processing framework)
HDFS(Hadoop Distributed File System)
http://www.meetup.com/abctalks
 A scalable, fault-tolerant, High Performance distributed file
system
 Asynchronous Replication
 Write-Once Read Many (WORM)
 Hadoop cluster with 3 data nodes minimum
 Data divided into 64 MB(default) or 128 MB blocks, each
block replicated 3 times by default
 No RAID required for DataNode
 Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP
 NameNode holds the file system metadata
 Files are broken up and spread over DataNodes
MapReduce(Distributed data processing framework)
http://www.meetup.com/abctalks
 Software Framework for distributed Computation
 Input | map () | CopySort | Reduce {} | Output
 Jobtracker schedules and manages jobs
 Tasktracker executes individual map() and reduce() tasks on
each cluster node
HDFS – Read File
http://www.meetup.com/abctalks
HDFS - Write File
http://www.meetup.com/abctalks
MapReduce - Executing File
http://www.meetup.com/abctalks
 Client program is copied on each node
 JobTracker determines number of splits from input path & then select
some task trackers based on their network proximity to the data sources
 Now JobTracker sends task request to the selected TaskTrackers
 Each TaskTracker starts the map phase processing by extracting the
input data from the splits
 Once Map task completes, TaskTracker notifies the JobTracker.
 When all TaskTrackers complete mapper phase, TaskTracker will notify
the selected TaskTrackers for reducer phase.
 Each TaskTracker reads region files remotely & invokes the reverse
function, which collects the key/aggregated value into the output file
(one per reducer node).
 After both mapper & reducer phases are completed, the JobTracker
unblocks the client program.
Java MapReduce Example
http://www.meetup.com/abctalks
 Let us go with the basic word count example which helps us
to understand the workflow easily
 Let us now dive into the demo of word count and understand
how does mapper, reducer functions and more..
Introduction to Amazon AWS & EMR
http://www.meetup.com/abctalks
AWS is an cloud infrastructure which
provides
 Elastic Capacity
 Quick and Easy Deployment
 No CapEx, No initial investment
 Pay as you go, for what you use
 Automation & Reusable components
Amazon EMR : Hadoop in Cloud
http://www.meetup.com/abctalks
 Scalable and fault tolerant
 Flexibility for multiple languages
and data formats
 Open Source
 Ecosystem of tools
 Batch and real-time analytics
 Amazon EMR is the easiest way to
run hadoop in the cloud
 Now let us look at the same example
we did on single node cluster on EMR
and look at the feasibility of doing it
Thank You !!
http://www.meetup.com/abctalks
https://www.facebook.com/abctalks

More Related Content

What's hot

Apache Spark™ is here to stay
Apache Spark™ is here to stayApache Spark™ is here to stay
Apache Spark™ is here to stayGiovanna Roda
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersDatabricks
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit
 
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupSteve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupbigdatalondon
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Petr Zapletal
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveJoydeep Sen Sarma
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Sergio Fernández
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkDatabricks
 

What's hot (20)

Apache Spark™ is here to stay
Apache Spark™ is here to stayApache Spark™ is here to stay
Apache Spark™ is here to stay
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupSteve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetup
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
F07-Cloud-Hadoop-BAM
F07-Cloud-Hadoop-BAMF07-Cloud-Hadoop-BAM
F07-Cloud-Hadoop-BAM
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspective
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
 

Similar to Map Reduce along with Amazon EMR

CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayJosef Adersberger
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationKnoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationKnoldus Inc.
 

Similar to Map Reduce along with Amazon EMR (20)

CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Ruby in the Clouds
Ruby in the CloudsRuby in the Clouds
Ruby in the Clouds
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Map Reduce along with Amazon EMR

  • 1. Map Reduce along with Amazon EMR Sampath Rachakonda & Siva Krishna Battu Bigdata Analytics on Cloud Meetup 14th March 2015 http://www.meetup.com/abctalks
  • 2. Agenda http://www.meetup.com/abctalks  Introduction to BigData and Hadoop MapReduce  Core Hadoop and its Ecosystem  Use Cases  Hadoop Installation on Windows/Ubuntu  Work flow on Map Reduce  M-R on EMR  Real-time examples on EMR  What's Next ? Hadoop 2.0!
  • 3. Introduction to BigData http://www.meetup.com/abctalks  It is the latest buzz but on the other hand data is an opportunity
  • 4. BigData – What & How http://www.meetup.com/abctalks
  • 5. Contd.. http://www.meetup.com/abctalks  Extremely large datasets that are hard to deal with using relational databases  Storage/Cost  Search/Performance  Analytics and Visualization  Need for parallel processing on hundreds of machines  ETL cannot complete within a reasonable amount of time  Beyond 24 hours never catch up
  • 7. Solution to handle BigData http://www.meetup.com/abctalks  Distributed File System  System shall manage and heal itself  Automatically route around failure  Speculatively execute redundant tasks based on performance  Performance Scale Linearly  Proportional change in capacity with resource change  Compute should move to data  Lower Latency, Lower Bandwidth
  • 8. Introduction Apache Hadoop http://www.meetup.com/abctalks  What is Hadoop ? A scalable fault tolerant grid operating system for data storage and processing.  Open Source, Apache License  Works with Structured and Unstructured Data  HDFS: Fault-Tolerant high-bandwidth clustered storage  Commodity Hardware  Master (name-node) – Slave Architecture  MapReduce : Distributed Data Processing
  • 9. Hadoop Cluster http://www.meetup.com/abctalks  A set of “cheap” commodity hardware  Networked together  Resides in same location in set of racks in a data centre  No super computers, use commodity unreliable hardware
  • 10. Hadoop System Principles http://www.meetup.com/abctalks  Scale-Out rather than Scale-Up  Bring Code to data rather than Data to Code  Deal with failures - they are common  Abstract complexity of distributed and concurrent applications
  • 11. RDBMS Vs Hadoop http://www.meetup.com/abctalks  Before hadoop many applications used RDBMS for batch processing like Oracle, MySQL, Sybase, etc..  Hadoop doesn't fully replace RDBMS the architecture  RDBMS products Scale-up rather than Scale-Out with limitations of 100s of terabytes  Structured Vs Unstructured  Offline Batch Vs Online Transactions
  • 12. Hadoop + RDBMS Complements each other http://www.meetup.com/abctalks  For example a small website with small number of users generating large amount of audit logs :  WebServer (1)--> RDBMS --> (2)&(4) --> Hadoop(3)  Use RDBMS for rich user interface and enforce data integrity  RDBMS generates lots of audit logs; the logs are moved periodically to hadoop cluster  All logs are kept & processed in Hadoop for various analytics  Results from hadoop cluster are stored back onto RDBMS to be used by web server. Ex: Suggestions based on audit history
  • 13. Hadoop Eco System http://www.meetup.com/abctalks  Hadoop mainly comprised of two core components :  HDFS(Hadoop Distributed File System) to store data & process data  MapReduce(Distributed data processing framework)
  • 14. HDFS(Hadoop Distributed File System) http://www.meetup.com/abctalks  A scalable, fault-tolerant, High Performance distributed file system  Asynchronous Replication  Write-Once Read Many (WORM)  Hadoop cluster with 3 data nodes minimum  Data divided into 64 MB(default) or 128 MB blocks, each block replicated 3 times by default  No RAID required for DataNode  Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP  NameNode holds the file system metadata  Files are broken up and spread over DataNodes
  • 15. MapReduce(Distributed data processing framework) http://www.meetup.com/abctalks  Software Framework for distributed Computation  Input | map () | CopySort | Reduce {} | Output  Jobtracker schedules and manages jobs  Tasktracker executes individual map() and reduce() tasks on each cluster node
  • 16. HDFS – Read File http://www.meetup.com/abctalks
  • 17. HDFS - Write File http://www.meetup.com/abctalks
  • 18. MapReduce - Executing File http://www.meetup.com/abctalks  Client program is copied on each node  JobTracker determines number of splits from input path & then select some task trackers based on their network proximity to the data sources  Now JobTracker sends task request to the selected TaskTrackers  Each TaskTracker starts the map phase processing by extracting the input data from the splits  Once Map task completes, TaskTracker notifies the JobTracker.  When all TaskTrackers complete mapper phase, TaskTracker will notify the selected TaskTrackers for reducer phase.  Each TaskTracker reads region files remotely & invokes the reverse function, which collects the key/aggregated value into the output file (one per reducer node).  After both mapper & reducer phases are completed, the JobTracker unblocks the client program.
  • 19. Java MapReduce Example http://www.meetup.com/abctalks  Let us go with the basic word count example which helps us to understand the workflow easily  Let us now dive into the demo of word count and understand how does mapper, reducer functions and more..
  • 20. Introduction to Amazon AWS & EMR http://www.meetup.com/abctalks AWS is an cloud infrastructure which provides  Elastic Capacity  Quick and Easy Deployment  No CapEx, No initial investment  Pay as you go, for what you use  Automation & Reusable components
  • 21. Amazon EMR : Hadoop in Cloud http://www.meetup.com/abctalks  Scalable and fault tolerant  Flexibility for multiple languages and data formats  Open Source  Ecosystem of tools  Batch and real-time analytics  Amazon EMR is the easiest way to run hadoop in the cloud  Now let us look at the same example we did on single node cluster on EMR and look at the feasibility of doing it

Editor's Notes

  1. http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/