Map Reduce along with Amazon EMR

Map Reduce along with
Amazon EMR
Sampath Rachakonda & Siva Krishna Battu
Bigdata Analytics on Cloud Meetup
14th March 2015
http://www.meetup.com/abctalks

Agenda
 Introduction to BigData and Hadoop
MapReduce
 Core Hadoop and its Ecosystem
 Use Cases
 Hadoop Installation on Windows/Ubuntu
 Work flow on Map Reduce
 M-R on EMR
 Real-time examples on EMR
 What's Next ? Hadoop 2.0!

Introduction to BigData
 It is the latest buzz but on the other hand data is an opportunity

BigData – What & How

Contd..
 Extremely large datasets that are hard to deal with using
relational databases
 Storage/Cost
 Search/Performance
 Analytics and Visualization
 Need for parallel processing on hundreds of machines
 ETL cannot complete within a reasonable amount of time
 Beyond 24 hours never catch up

Solution to handle BigData
 Distributed File System
 System shall manage and heal itself
 Automatically route around failure
 Speculatively execute redundant tasks based on
performance
 Performance Scale Linearly
 Proportional change in capacity with resource change
 Compute should move to data
 Lower Latency, Lower Bandwidth

Introduction Apache Hadoop
 What is Hadoop ?
A scalable fault tolerant grid operating system for data storage and
processing.
 Open Source, Apache License
 Works with Structured and Unstructured Data
 HDFS: Fault-Tolerant high-bandwidth clustered storage
 Commodity Hardware
 Master (name-node) – Slave Architecture
 MapReduce : Distributed Data Processing

Hadoop Cluster
 A set of “cheap” commodity hardware
 Networked together
 Resides in same location in set of
racks in a data centre
 No super computers, use commodity
unreliable hardware

Hadoop System Principles
 Scale-Out rather than Scale-Up
 Bring Code to data rather than Data to Code
 Deal with failures - they are common
 Abstract complexity of distributed and concurrent applications

RDBMS Vs Hadoop
 Before hadoop many applications
used RDBMS for batch processing
like Oracle, MySQL, Sybase, etc..
 Hadoop doesn't fully replace RDBMS
the architecture
 RDBMS products Scale-up rather
than Scale-Out with limitations of
100s of terabytes
 Structured Vs Unstructured
 Offline Batch Vs Online Transactions

Hadoop + RDBMS Complements each other
 For example a small website with small number of users
generating large amount of audit logs :
 WebServer (1)--> RDBMS --> (2)&(4) --> Hadoop(3)
 Use RDBMS for rich user interface and enforce data
integrity
 RDBMS generates lots of audit logs; the logs are moved
periodically to hadoop cluster
 All logs are kept & processed in Hadoop for various
analytics
 Results from hadoop cluster are stored back onto RDBMS
to be used by web server. Ex: Suggestions based on audit
history

Hadoop Eco System
 Hadoop mainly comprised of two
core components :
 HDFS(Hadoop Distributed File
System) to store data & process
data
 MapReduce(Distributed data
processing framework)

HDFS(Hadoop Distributed File System)
 A scalable, fault-tolerant, High Performance distributed file
system
 Asynchronous Replication
 Write-Once Read Many (WORM)
 Hadoop cluster with 3 data nodes minimum
 Data divided into 64 MB(default) or 128 MB blocks, each
block replicated 3 times by default
 No RAID required for DataNode
 Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP
 NameNode holds the file system metadata
 Files are broken up and spread over DataNodes

MapReduce(Distributed data processing framework)
 Software Framework for distributed Computation
 Input | map () | CopySort | Reduce {} | Output
 Jobtracker schedules and manages jobs
 Tasktracker executes individual map() and reduce() tasks on
each cluster node

HDFS – Read File

HDFS - Write File

MapReduce - Executing File
 Client program is copied on each node
 JobTracker determines number of splits from input path & then select
some task trackers based on their network proximity to the data sources
 Now JobTracker sends task request to the selected TaskTrackers
 Each TaskTracker starts the map phase processing by extracting the
input data from the splits
 Once Map task completes, TaskTracker notifies the JobTracker.
 When all TaskTrackers complete mapper phase, TaskTracker will notify
the selected TaskTrackers for reducer phase.
 Each TaskTracker reads region files remotely & invokes the reverse
function, which collects the key/aggregated value into the output file
(one per reducer node).
 After both mapper & reducer phases are completed, the JobTracker
unblocks the client program.

Java MapReduce Example
 Let us go with the basic word count example which helps us
to understand the workflow easily
 Let us now dive into the demo of word count and understand
how does mapper, reducer functions and more..

Introduction to Amazon AWS & EMR
AWS is an cloud infrastructure which
provides
 Elastic Capacity
 Quick and Easy Deployment
 No CapEx, No initial investment
 Pay as you go, for what you use
 Automation & Reusable components

Amazon EMR : Hadoop in Cloud
 Scalable and fault tolerant
 Flexibility for multiple languages
and data formats
 Open Source
 Ecosystem of tools
 Batch and real-time analytics
 Amazon EMR is the easiest way to
run hadoop in the cloud
 Now let us look at the same example
we did on single node cluster on EMR
and look at the feasibility of doing it

Thank You !!
https://www.facebook.com/abctalks

Map Reduce along with Amazon EMR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Map Reduce along with Amazon EMR

Similar to Map Reduce along with Amazon EMR (20)

Recently uploaded

Recently uploaded (20)

Map Reduce along with Amazon EMR

Editor's Notes