SlideShare a Scribd company logo
Big Data and Hadoop in Cloud

               Vijay Rayapati
               @amnigos
                                1
Follow Barcamp Rules!
What is Big Data?



Datasets that grow so large that they become
awkward to work with using on-hand database
management tools. Difficulties include
capture, storage, search, sharing, analytics,
and visualizing - Wikipedia


High volume of data (storage) + speed of data
(scale) + variety of data (diff types) - Gartner
World is ON = Content + Interactions = More Data
            (Social and Mobile)
Tons of data is generated by each one of us!

 (We moved from GB to ZB and from Millions to Zillions)
Big Data - Intelligence
Big Data - Usefulness
Big Data - There is so much more you can do!
Everybody has this problem – Not just Amazon, Google,
                Facebook and Twitter!
How can we work with
     Big Data?
Why Cloud and Big Data?

Cloud has democratized access to large
scale infrastructure for masses!




You can store, process and manage big
data sets without worrying about IT!

                    **http://wiki.apache.org/hadoop/PoweredBy
Hadoop – The data elephant
Hadoop makes it easier to
store, process and analyze
 lot of data on commodity
         hardware!
Who uses Hadoop and How?




Everybody (from A to Z )
         to
Solve complex problems




               **http://wiki.apache.org/hadoop/PoweredBy
Big Data and Hadoop - It’s Fun
Task Tracker   Task Tracker   Task Tracker
Map Reduce
(processing)

               Job Tracker



                 Name Node


HDFS Layer
 (storage)       Data Node       Data Node    Data Node


               Master Node
Map Reduce Paradigm
Map Reduce - Explained
Hadoop – Getting Started


• Download latest stable version - http://hadoop.apache.org/common/releases.html

• Install Java ( > 1.6.0_20 ) and set your JAVA_HOME

• Install rsync and ssh

• Follow instructions - http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html

• Hadoop Modes – Local, Pseudo-distributed and Fully distributed

• Run in pseudo-distributed mode for your testing and development

• Assign a decent jvm heapsize through mapred.child.java.opts if you
  notice task errors or GC overhead or OOM

• Play with samples – WordCount, TeraSort etc

• Good for learning - http://www.cloudera.com/hadoop-training-virtual-machine
Why Amazon EMR?



I am interested in using Hadoop
to solve problems and not in
building and managing Hadoop
Infrastructure!
Amazon EMR – Setup


• Install Ruby 1.8.X and use EMR Ruby CLI for managing EMR.

• Just create credentials.json file in your EMR Ruby CLI installation
  directory and provide your accesskey & private key.

• Bootstrapping is a great way to install required components or
  perform custom actions in your EMR cluster.

• Default bootstrap action is available to control the configuration of
  Hadoop and MapReduce.

• Bootstrap with Ganglia during your development and tuning phase –
  provides monitoring metrics across your cluster.

• Minor bugs in EMR Ruby CLI but pretty cool for your needs.
Amazon EMR – Setup


• Launching a 500 node and fully configured cluster is as simple
  as firing one command

   > elastic-mapreduce --create --alive --plain-output --master-instance-type
   m1.xlarge --slave-instance-type m2.2xlarge --num-instances 500 --name
   "Site Analytics Cluster" --bootstrap-action
   s3://com.bcb11.emr/scripts/bootstrap-custom.sh
   --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia -
   -bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-
   hadoop --args "--mapred-config-file, s3://com.bcb11.emr/conf/custom-
   mapred-site.xml"

   > elastic-mapreduce -j ${jobflow} --stream --step-name “Profile Analyzer" --
   jobconf mapred.task.timeout=0 --mapper
   s3://com.bcb11.emr/code/mapper.rb --reducer
   s3://com.bcb11.emr/bin/reducer.rb --cache
   s3://com.bcb11.emr/cache/customdata.dat#data.txt --input
   s3://com.bcb11.emr/input/ --output s3://com.bcb11.emr/output
Amazon EMR - Service Architecture
EMR CLI – What you need to know?


• elastic-mapreduce -j <jobflow id> --describe

• elastic-mapreduce --list --active

• elastic-mapreduce -j <jobflow id> --terminate

• elastic-mapreduce --jobflow <jobflow id> --ssh

• Look into your logs directory in the S3 if you need any other
  information on cluster setup, hadoop logs, Job step logs, Task
  attempt logs etc.
EMR Map Reduce Jobs



• Amazon EMR supports – streaming, custom jar, cascading, pig
  and hive. So you can write jobs in a you want without worrying
  about managing the underlying infrastructure including hadoop.

• Streaming – Write Map Reduce jobs in any scripting language.

• Custom Jar – Write using Java and good for speed/control.

• Cascading, Hive and Pig – Higher level of abstraction.

• Use a good S3 explorer, FoxyProxy and ElasticFox.

• Leverage aws emr forum if you need help.
EMR – Debugging and Performance Tuning
Hadoop – Debugging and Profiling


• Run hadoop in local mode for debugging so mapper and reducer
  tasks run in a single JVM instead of separate JVMs.

• Configure Hadoop_Opts to enable debugging.
 (export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008“)

• Configure fs.default.name value in core-site.xml to file:/// from hdfs://

• Configure mapred.job.tracker value in mapred-site.xml to local

• Create debug configuration for Eclipse and set the port to 8008.

• Run your hadoop job and launch Eclipse with your Java code so you
  can start debugging.

• Use your favorite profiler to understand code level hotspots.
EMR – Good, Bad and Ugly


• Great for bootstrapping large clusters and very cost-effective if
  you need once in a while infrastructure to run your Hadoop jobs.

• Don’t need to worry about underlying Hadoop cluster setup and
  management. Most patches are applied and Amazon creates new
  AMI’s with improvements.

• Doesn’t have a fall back (secondary name node) – only one
  master node.

• Intermittent Network Issues – Sometimes could cause serious
  degradation of performance.

• Network IO is variable and streaming jobs will be much sluggish
  on EMR compared to dedicated setup.

• Disk IO is terrible across instance families and types – Please fix
  it.
Hadoop – High Level Tuning




  Small files problem – avoid too                        Tune your settings – JVM
  many small files and tune your                         Reuse, Sort Buffer, Sort Factor,
  block size.                                            Map/Reduce Tasks, Parallel
                                                         Copies, MapRed Output
                                                         Compression etc




                                                           Good thing is that you can
Know what is limiting you at a
                                                           use small cluster and sample
node level – CPU, Memory,
                                                           input size for tuning
DISK IO or Network IN/OUT
Hadoop – What effects your jobs performance?


• GC Overhead - memory and reduce the jvm reuse tasks.

• Increase dfs block size (default 128MB in EMR) for large files.

• Avoid read contention at S3 – have equal or more files in S3
  compared to available mappers.

• Use mapred output compression to save storage, processing
  time and bandwidth costs.

• Set mapred task timeout to 0 if you have long running jobs (> 10
  mins) and can disable speculative execution time.

• Increase sort buffer and sort factor based on map tasks output.
Understand – EMR Cluster Metrics
Understand – EMR Cluster Metrics
Common Bottlenecks – Monitor Matters
Hadoop and EMR – What I have learned?


• Code is god – If you have severe performance issues then look at
  your code 100 times, understand third party libraries used and
  rewrite in Java if required.

• Streaming jobs are slow compared to Custom Jar jobs – Over
  head and scripting is good for adhoc-analysis.

• Disk IO and Network IO effects your processing time.

• Be ready to face variable performance in Cloud.

• Monitor everything once in a while and keep benchmarking with
  data points.

• Default settings are seldom optimal in EMR – unless you run
  simple jobs.

• Focus on optimization as it’s the only way to save Cost and Time.
Hadoop and EMR – Performance Tuning Example


• Streaming : Map reduce jobs were written using Ruby. Input
  dataset was 150 GB and output was around 4000 GB. Complex
  processing, highly CPU bound and Disk IO.

• Time taken to complete job processing : 4000 m1.xlarge nodes
  and 180 minutes.

• Rewrote the code in Java – job processing time was reduced to
  70 minutes on just 400 m1.xlarge nodes.

• Tuning EMR configuration has further reduced it to 32 minutes.

• Focus on code first and then focus on configuration.
Q&A
Like what we do? – connect with me
        Kuliza.com | vijay.rayapati@kuliza.com | @kuliza




                           vijay.rayapati@kuliza.com
                           @amnigos

More Related Content

What's hot

Tune hadoop
Tune hadoopTune hadoop
Tune hadoop
Jason Shao
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
Amal G Jose
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
Shivkumar Babshetty
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin Tasks
Edureka!
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationDataWorks Summit
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
Manaranjan Pradhan
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
Ted Dunning
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Cloudera, Inc.
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
Hadoop
HadoopHadoop
Hadoop
Cassell Hsu
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Cloudera, Inc.
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
Anil Reddy
 

What's hot (20)

Tune hadoop
Tune hadoopTune hadoop
Tune hadoop
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin Tasks
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 

Viewers also liked

Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
DataWorks Summit
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
Israel AWS User Group
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
WebDB Forum 2016 gunosy
WebDB Forum 2016 gunosyWebDB Forum 2016 gunosy
WebDB Forum 2016 gunosy
Hiroaki Kudo
 
Data Center Migration to the AWS Cloud
Data Center Migration to the AWS CloudData Center Migration to the AWS Cloud
Data Center Migration to the AWS CloudTom Laszewski
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
Amazon Web Services
 

Viewers also liked (10)

Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
WebDB Forum 2016 gunosy
WebDB Forum 2016 gunosyWebDB Forum 2016 gunosy
WebDB Forum 2016 gunosy
 
Data Center Migration to the AWS Cloud
Data Center Migration to the AWS CloudData Center Migration to the AWS Cloud
Data Center Migration to the AWS Cloud
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 

Similar to Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
Edureka!
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Lessons learned scaling big data in cloud
Lessons learned   scaling big data in cloudLessons learned   scaling big data in cloud
Lessons learned scaling big data in cloud
Vijay Rayapati
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Etu Solution
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
DataWorks Summit
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
Hadoop
HadoopHadoop

Similar to Big Data and Hadoop in Cloud - Leveraging Amazon EMR (20)

Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Lessons learned scaling big data in cloud
Lessons learned   scaling big data in cloudLessons learned   scaling big data in cloud
Lessons learned scaling big data in cloud
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Hadoop
HadoopHadoop
Hadoop
 

More from Vijay Rayapati

Botmetric Product Design Process
Botmetric Product Design ProcessBotmetric Product Design Process
Botmetric Product Design Process
Vijay Rayapati
 
Scalable load testing using jmeter in cloud
Scalable load testing using jmeter in cloudScalable load testing using jmeter in cloud
Scalable load testing using jmeter in cloud
Vijay Rayapati
 
Building Culture at Kuliza
Building Culture at KulizaBuilding Culture at Kuliza
Building Culture at Kuliza
Vijay Rayapati
 
Introduction to cloud computing - za garage talks
Introduction to cloud computing -  za garage talksIntroduction to cloud computing -  za garage talks
Introduction to cloud computing - za garage talks
Vijay Rayapati
 
"Introduction Open Graph and Facebook Platform" - Facebook Developer Garage ...
"Introduction Open Graph and Facebook Platform" -  Facebook Developer Garage ..."Introduction Open Graph and Facebook Platform" -  Facebook Developer Garage ...
"Introduction Open Graph and Facebook Platform" - Facebook Developer Garage ...
Vijay Rayapati
 
"Leveraging Virality aspects in Facebook Platform" -- Facebook Developer Gar...
"Leveraging Virality aspects in Facebook Platform" --  Facebook Developer Gar..."Leveraging Virality aspects in Facebook Platform" --  Facebook Developer Gar...
"Leveraging Virality aspects in Facebook Platform" -- Facebook Developer Gar...
Vijay Rayapati
 
"Smart Hiring App on Facebook" - Facebook Developer Garage Bangalore
"Smart Hiring App on Facebook"  -  Facebook Developer Garage Bangalore"Smart Hiring App on Facebook"  -  Facebook Developer Garage Bangalore
"Smart Hiring App on Facebook" - Facebook Developer Garage Bangalore
Vijay Rayapati
 
"Facebook Platform Best Practices" - Facebook Developer Garage Bangalore
"Facebook Platform Best Practices" -  Facebook Developer Garage Bangalore"Facebook Platform Best Practices" -  Facebook Developer Garage Bangalore
"Facebook Platform Best Practices" - Facebook Developer Garage Bangalore
Vijay Rayapati
 
Performance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For SpeedPerformance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For Speed
Vijay Rayapati
 
How Cafe Coffee Day Handled Their Online Crisis
How Cafe Coffee Day Handled Their Online CrisisHow Cafe Coffee Day Handled Their Online Crisis
How Cafe Coffee Day Handled Their Online Crisis
Vijay Rayapati
 
Giza Page Hiring
Giza Page HiringGiza Page Hiring
Giza Page Hiring
Vijay Rayapati
 
Nasscom Product Conclave 2009 - Feedback collected using Twitter
Nasscom Product Conclave 2009 - Feedback collected using TwitterNasscom Product Conclave 2009 - Feedback collected using Twitter
Nasscom Product Conclave 2009 - Feedback collected using Twitter
Vijay Rayapati
 
Social Media Engagement
Social Media EngagementSocial Media Engagement
Social Media Engagement
Vijay Rayapati
 

More from Vijay Rayapati (13)

Botmetric Product Design Process
Botmetric Product Design ProcessBotmetric Product Design Process
Botmetric Product Design Process
 
Scalable load testing using jmeter in cloud
Scalable load testing using jmeter in cloudScalable load testing using jmeter in cloud
Scalable load testing using jmeter in cloud
 
Building Culture at Kuliza
Building Culture at KulizaBuilding Culture at Kuliza
Building Culture at Kuliza
 
Introduction to cloud computing - za garage talks
Introduction to cloud computing -  za garage talksIntroduction to cloud computing -  za garage talks
Introduction to cloud computing - za garage talks
 
"Introduction Open Graph and Facebook Platform" - Facebook Developer Garage ...
"Introduction Open Graph and Facebook Platform" -  Facebook Developer Garage ..."Introduction Open Graph and Facebook Platform" -  Facebook Developer Garage ...
"Introduction Open Graph and Facebook Platform" - Facebook Developer Garage ...
 
"Leveraging Virality aspects in Facebook Platform" -- Facebook Developer Gar...
"Leveraging Virality aspects in Facebook Platform" --  Facebook Developer Gar..."Leveraging Virality aspects in Facebook Platform" --  Facebook Developer Gar...
"Leveraging Virality aspects in Facebook Platform" -- Facebook Developer Gar...
 
"Smart Hiring App on Facebook" - Facebook Developer Garage Bangalore
"Smart Hiring App on Facebook"  -  Facebook Developer Garage Bangalore"Smart Hiring App on Facebook"  -  Facebook Developer Garage Bangalore
"Smart Hiring App on Facebook" - Facebook Developer Garage Bangalore
 
"Facebook Platform Best Practices" - Facebook Developer Garage Bangalore
"Facebook Platform Best Practices" -  Facebook Developer Garage Bangalore"Facebook Platform Best Practices" -  Facebook Developer Garage Bangalore
"Facebook Platform Best Practices" - Facebook Developer Garage Bangalore
 
Performance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For SpeedPerformance Tuning Web Apps - The Need For Speed
Performance Tuning Web Apps - The Need For Speed
 
How Cafe Coffee Day Handled Their Online Crisis
How Cafe Coffee Day Handled Their Online CrisisHow Cafe Coffee Day Handled Their Online Crisis
How Cafe Coffee Day Handled Their Online Crisis
 
Giza Page Hiring
Giza Page HiringGiza Page Hiring
Giza Page Hiring
 
Nasscom Product Conclave 2009 - Feedback collected using Twitter
Nasscom Product Conclave 2009 - Feedback collected using TwitterNasscom Product Conclave 2009 - Feedback collected using Twitter
Nasscom Product Conclave 2009 - Feedback collected using Twitter
 
Social Media Engagement
Social Media EngagementSocial Media Engagement
Social Media Engagement
 

Recently uploaded

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

  • 1. Big Data and Hadoop in Cloud Vijay Rayapati @amnigos 1
  • 3. What is Big Data? Datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing - Wikipedia High volume of data (storage) + speed of data (scale) + variety of data (diff types) - Gartner
  • 4. World is ON = Content + Interactions = More Data (Social and Mobile)
  • 5. Tons of data is generated by each one of us! (We moved from GB to ZB and from Millions to Zillions)
  • 6. Big Data - Intelligence
  • 7. Big Data - Usefulness
  • 8. Big Data - There is so much more you can do!
  • 9. Everybody has this problem – Not just Amazon, Google, Facebook and Twitter!
  • 10. How can we work with Big Data?
  • 11. Why Cloud and Big Data? Cloud has democratized access to large scale infrastructure for masses! You can store, process and manage big data sets without worrying about IT! **http://wiki.apache.org/hadoop/PoweredBy
  • 12. Hadoop – The data elephant
  • 13. Hadoop makes it easier to store, process and analyze lot of data on commodity hardware!
  • 14. Who uses Hadoop and How? Everybody (from A to Z ) to Solve complex problems **http://wiki.apache.org/hadoop/PoweredBy
  • 15. Big Data and Hadoop - It’s Fun
  • 16. Task Tracker Task Tracker Task Tracker Map Reduce (processing) Job Tracker Name Node HDFS Layer (storage) Data Node Data Node Data Node Master Node
  • 17.
  • 19. Map Reduce - Explained
  • 20.
  • 21. Hadoop – Getting Started • Download latest stable version - http://hadoop.apache.org/common/releases.html • Install Java ( > 1.6.0_20 ) and set your JAVA_HOME • Install rsync and ssh • Follow instructions - http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html • Hadoop Modes – Local, Pseudo-distributed and Fully distributed • Run in pseudo-distributed mode for your testing and development • Assign a decent jvm heapsize through mapred.child.java.opts if you notice task errors or GC overhead or OOM • Play with samples – WordCount, TeraSort etc • Good for learning - http://www.cloudera.com/hadoop-training-virtual-machine
  • 22. Why Amazon EMR? I am interested in using Hadoop to solve problems and not in building and managing Hadoop Infrastructure!
  • 23. Amazon EMR – Setup • Install Ruby 1.8.X and use EMR Ruby CLI for managing EMR. • Just create credentials.json file in your EMR Ruby CLI installation directory and provide your accesskey & private key. • Bootstrapping is a great way to install required components or perform custom actions in your EMR cluster. • Default bootstrap action is available to control the configuration of Hadoop and MapReduce. • Bootstrap with Ganglia during your development and tuning phase – provides monitoring metrics across your cluster. • Minor bugs in EMR Ruby CLI but pretty cool for your needs.
  • 24. Amazon EMR – Setup • Launching a 500 node and fully configured cluster is as simple as firing one command > elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge --slave-instance-type m2.2xlarge --num-instances 500 --name "Site Analytics Cluster" --bootstrap-action s3://com.bcb11.emr/scripts/bootstrap-custom.sh --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia - -bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --args "--mapred-config-file, s3://com.bcb11.emr/conf/custom- mapred-site.xml" > elastic-mapreduce -j ${jobflow} --stream --step-name “Profile Analyzer" -- jobconf mapred.task.timeout=0 --mapper s3://com.bcb11.emr/code/mapper.rb --reducer s3://com.bcb11.emr/bin/reducer.rb --cache s3://com.bcb11.emr/cache/customdata.dat#data.txt --input s3://com.bcb11.emr/input/ --output s3://com.bcb11.emr/output
  • 25. Amazon EMR - Service Architecture
  • 26. EMR CLI – What you need to know? • elastic-mapreduce -j <jobflow id> --describe • elastic-mapreduce --list --active • elastic-mapreduce -j <jobflow id> --terminate • elastic-mapreduce --jobflow <jobflow id> --ssh • Look into your logs directory in the S3 if you need any other information on cluster setup, hadoop logs, Job step logs, Task attempt logs etc.
  • 27. EMR Map Reduce Jobs • Amazon EMR supports – streaming, custom jar, cascading, pig and hive. So you can write jobs in a you want without worrying about managing the underlying infrastructure including hadoop. • Streaming – Write Map Reduce jobs in any scripting language. • Custom Jar – Write using Java and good for speed/control. • Cascading, Hive and Pig – Higher level of abstraction. • Use a good S3 explorer, FoxyProxy and ElasticFox. • Leverage aws emr forum if you need help.
  • 28. EMR – Debugging and Performance Tuning
  • 29. Hadoop – Debugging and Profiling • Run hadoop in local mode for debugging so mapper and reducer tasks run in a single JVM instead of separate JVMs. • Configure Hadoop_Opts to enable debugging. (export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008“) • Configure fs.default.name value in core-site.xml to file:/// from hdfs:// • Configure mapred.job.tracker value in mapred-site.xml to local • Create debug configuration for Eclipse and set the port to 8008. • Run your hadoop job and launch Eclipse with your Java code so you can start debugging. • Use your favorite profiler to understand code level hotspots.
  • 30. EMR – Good, Bad and Ugly • Great for bootstrapping large clusters and very cost-effective if you need once in a while infrastructure to run your Hadoop jobs. • Don’t need to worry about underlying Hadoop cluster setup and management. Most patches are applied and Amazon creates new AMI’s with improvements. • Doesn’t have a fall back (secondary name node) – only one master node. • Intermittent Network Issues – Sometimes could cause serious degradation of performance. • Network IO is variable and streaming jobs will be much sluggish on EMR compared to dedicated setup. • Disk IO is terrible across instance families and types – Please fix it.
  • 31. Hadoop – High Level Tuning Small files problem – avoid too Tune your settings – JVM many small files and tune your Reuse, Sort Buffer, Sort Factor, block size. Map/Reduce Tasks, Parallel Copies, MapRed Output Compression etc Good thing is that you can Know what is limiting you at a use small cluster and sample node level – CPU, Memory, input size for tuning DISK IO or Network IN/OUT
  • 32. Hadoop – What effects your jobs performance? • GC Overhead - memory and reduce the jvm reuse tasks. • Increase dfs block size (default 128MB in EMR) for large files. • Avoid read contention at S3 – have equal or more files in S3 compared to available mappers. • Use mapred output compression to save storage, processing time and bandwidth costs. • Set mapred task timeout to 0 if you have long running jobs (> 10 mins) and can disable speculative execution time. • Increase sort buffer and sort factor based on map tasks output.
  • 33. Understand – EMR Cluster Metrics
  • 34. Understand – EMR Cluster Metrics
  • 35. Common Bottlenecks – Monitor Matters
  • 36. Hadoop and EMR – What I have learned? • Code is god – If you have severe performance issues then look at your code 100 times, understand third party libraries used and rewrite in Java if required. • Streaming jobs are slow compared to Custom Jar jobs – Over head and scripting is good for adhoc-analysis. • Disk IO and Network IO effects your processing time. • Be ready to face variable performance in Cloud. • Monitor everything once in a while and keep benchmarking with data points. • Default settings are seldom optimal in EMR – unless you run simple jobs. • Focus on optimization as it’s the only way to save Cost and Time.
  • 37. Hadoop and EMR – Performance Tuning Example • Streaming : Map reduce jobs were written using Ruby. Input dataset was 150 GB and output was around 4000 GB. Complex processing, highly CPU bound and Disk IO. • Time taken to complete job processing : 4000 m1.xlarge nodes and 180 minutes. • Rewrote the code in Java – job processing time was reduced to 70 minutes on just 400 m1.xlarge nodes. • Tuning EMR configuration has further reduced it to 32 minutes. • Focus on code first and then focus on configuration.
  • 38. Q&A
  • 39. Like what we do? – connect with me Kuliza.com | vijay.rayapati@kuliza.com | @kuliza vijay.rayapati@kuliza.com @amnigos

Editor's Notes

  1. http://stayinfront.com/Portals/0/SIF_Analytics_graphs.jpghttp://www.hiero.com/images/web-analytics.jpg
  2. http://www.thriveanalytics.com/Thrive%20Analytics%20Data%20Analysis.jpg
  3. http://siliconangle.com/files/2011/11/love-big-data-300x300.jpg
  4. http://pictures.brafton.com/liveimages/Approach-of-big-data-means-cloud-computing-remains-popular_16000823_800695886_0_0_14029890_300.jpg
  5. http://z4webhosting.com/blog/wp-content/uploads/2011/10/325da1195fephant.jpg.jpg
  6. http://www.thriveanalytics.com/Thrive%20Analytics%20Data%20Analysis.jpg
  7. http://www.thriveanalytics.com/Thrive%20Analytics%20Data%20Analysis.jpg
  8. http://www.thriveanalytics.com/Thrive%20Analytics%20Data%20Analysis.jpg
  9. http://www.thriveanalytics.com/Thrive%20Analytics%20Data%20Analysis.jpg
  10. http://www.cloverquotes.com/images/picture_quotes/0/20_main.jpg?1321395710
  11. http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Introduction_EMRArch.html