Big Data and Hadoop in Cloud - Leveraging Amazon EMR


Published on

I did a talk on "Big Data and Hadoop in Cloud" at Barcamp Bangalore.

Published in: Technology
  • Awesome Presentation.

    I had come to know about your website from my friend Vinod, Hyderabad. Your blog gives the best and the most interesting information. This is just the kind of information that i had been looking for, I’m already your rss reader now and i would regularly watch out for the new posts, once again hats off to you! Thanks a lot. Regards, hadoop online training hyderabad
    Are you sure you want to  Yes  No
    Your message goes here
  • @sureshsambandam True that consumers wont mind if its MySQL or Cassandra which powers their data. But in my opinion, when tech ppl start a company, its better to choose the best scalable/dev open option. Reason being, when one chooses MySQL, and so all your schemas are based on columnar which makes it a bit difficult to change all the queries in your code in case you want to switch to any NoSQL for eg.
    And its good to know that you are based in Chennai :)
    Are you sure you want to  Yes  No
    Your message goes here
  • sir,
    im doing my final yr engg project on cloud computing and im facing some difficulties in installing hadoop ........acc to d slides shown in barcamp .......
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks Suresh...I agree - CODE is not of the points I mentioned was focus on solving your business problem not on infrastructure while describing why you should use Amazon EMR instead of building your Hadoop clusters :)
    Are you sure you want to  Yes  No
    Your message goes here
  • Very good presentation Vijay.

    But the real problem with startups is not choosing the right technology when you are starting out, so that it would scale when the growth happens, but really addressing the growth. Esp. in the consumer space no one cares about whether you use BigData or MySQL.

    On the other hand, if teams can start out with whatever technology they are already familiar without wasting time in using new techonology and then when the growth happens they can rewrite the product/platform as they go - that is best things for business. That is called inflight technology upgradation. While I understand that this may not be the right thing for all cases - but for most end-user apps this would be true . My view is hit the first 100,000 users. PHP may not the best programming langauge - but facebook users don't care!
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Big Data and Hadoop in Cloud - Leveraging Amazon EMR

    1. 1. Big Data and Hadoop in Cloud Vijay Rayapati @amnigos 1
    2. 2. Follow Barcamp Rules!
    3. 3. What is Big Data?Datasets that grow so large that they becomeawkward to work with using on-hand databasemanagement tools. Difficulties includecapture, storage, search, sharing, analytics,and visualizing - WikipediaHigh volume of data (storage) + speed of data(scale) + variety of data (diff types) - Gartner
    4. 4. World is ON = Content + Interactions = More Data (Social and Mobile)
    5. 5. Tons of data is generated by each one of us! (We moved from GB to ZB and from Millions to Zillions)
    6. 6. Big Data - Intelligence
    7. 7. Big Data - Usefulness
    8. 8. Big Data - There is so much more you can do!
    9. 9. Everybody has this problem – Not just Amazon, Google, Facebook and Twitter!
    10. 10. How can we work with Big Data?
    11. 11. Why Cloud and Big Data?Cloud has democratized access to largescale infrastructure for masses!You can store, process and manage bigdata sets without worrying about IT! **
    12. 12. Hadoop – The data elephant
    13. 13. Hadoop makes it easier tostore, process and analyze lot of data on commodity hardware!
    14. 14. Who uses Hadoop and How?Everybody (from A to Z ) toSolve complex problems **
    15. 15. Big Data and Hadoop - It’s Fun
    16. 16. Task Tracker Task Tracker Task TrackerMap Reduce(processing) Job Tracker Name NodeHDFS Layer (storage) Data Node Data Node Data Node Master Node
    17. 17. Map Reduce Paradigm
    18. 18. Map Reduce - Explained
    19. 19. Hadoop – Getting Started• Download latest stable version -• Install Java ( > 1.6.0_20 ) and set your JAVA_HOME• Install rsync and ssh• Follow instructions -• Hadoop Modes – Local, Pseudo-distributed and Fully distributed• Run in pseudo-distributed mode for your testing and development• Assign a decent jvm heapsize through if you notice task errors or GC overhead or OOM• Play with samples – WordCount, TeraSort etc• Good for learning -
    20. 20. Why Amazon EMR?I am interested in using Hadoopto solve problems and not inbuilding and managing HadoopInfrastructure!
    21. 21. Amazon EMR – Setup• Install Ruby 1.8.X and use EMR Ruby CLI for managing EMR.• Just create credentials.json file in your EMR Ruby CLI installation directory and provide your accesskey & private key.• Bootstrapping is a great way to install required components or perform custom actions in your EMR cluster.• Default bootstrap action is available to control the configuration of Hadoop and MapReduce.• Bootstrap with Ganglia during your development and tuning phase – provides monitoring metrics across your cluster.• Minor bugs in EMR Ruby CLI but pretty cool for your needs.
    22. 22. Amazon EMR – Setup• Launching a 500 node and fully configured cluster is as simple as firing one command > elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge --slave-instance-type m2.2xlarge --num-instances 500 --name "Site Analytics Cluster" --bootstrap-action s3://com.bcb11.emr/scripts/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia - -bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --args "--mapred-config-file, s3://com.bcb11.emr/conf/custom- mapred-site.xml" > elastic-mapreduce -j ${jobflow} --stream --step-name “Profile Analyzer" -- jobconf mapred.task.timeout=0 --mapper s3://com.bcb11.emr/code/mapper.rb --reducer s3://com.bcb11.emr/bin/reducer.rb --cache s3://com.bcb11.emr/cache/customdata.dat#data.txt --input s3://com.bcb11.emr/input/ --output s3://com.bcb11.emr/output
    23. 23. Amazon EMR - Service Architecture
    24. 24. EMR CLI – What you need to know?• elastic-mapreduce -j <jobflow id> --describe• elastic-mapreduce --list --active• elastic-mapreduce -j <jobflow id> --terminate• elastic-mapreduce --jobflow <jobflow id> --ssh• Look into your logs directory in the S3 if you need any other information on cluster setup, hadoop logs, Job step logs, Task attempt logs etc.
    25. 25. EMR Map Reduce Jobs• Amazon EMR supports – streaming, custom jar, cascading, pig and hive. So you can write jobs in a you want without worrying about managing the underlying infrastructure including hadoop.• Streaming – Write Map Reduce jobs in any scripting language.• Custom Jar – Write using Java and good for speed/control.• Cascading, Hive and Pig – Higher level of abstraction.• Use a good S3 explorer, FoxyProxy and ElasticFox.• Leverage aws emr forum if you need help.
    26. 26. EMR – Debugging and Performance Tuning
    27. 27. Hadoop – Debugging and Profiling• Run hadoop in local mode for debugging so mapper and reducer tasks run in a single JVM instead of separate JVMs.• Configure Hadoop_Opts to enable debugging. (export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008“)• Configure value in core-site.xml to file:/// from hdfs://• Configure mapred.job.tracker value in mapred-site.xml to local• Create debug configuration for Eclipse and set the port to 8008.• Run your hadoop job and launch Eclipse with your Java code so you can start debugging.• Use your favorite profiler to understand code level hotspots.
    28. 28. EMR – Good, Bad and Ugly• Great for bootstrapping large clusters and very cost-effective if you need once in a while infrastructure to run your Hadoop jobs.• Don’t need to worry about underlying Hadoop cluster setup and management. Most patches are applied and Amazon creates new AMI’s with improvements.• Doesn’t have a fall back (secondary name node) – only one master node.• Intermittent Network Issues – Sometimes could cause serious degradation of performance.• Network IO is variable and streaming jobs will be much sluggish on EMR compared to dedicated setup.• Disk IO is terrible across instance families and types – Please fix it.
    29. 29. Hadoop – High Level Tuning Small files problem – avoid too Tune your settings – JVM many small files and tune your Reuse, Sort Buffer, Sort Factor, block size. Map/Reduce Tasks, Parallel Copies, MapRed Output Compression etc Good thing is that you canKnow what is limiting you at a use small cluster and samplenode level – CPU, Memory, input size for tuningDISK IO or Network IN/OUT
    30. 30. Hadoop – What effects your jobs performance?• GC Overhead - memory and reduce the jvm reuse tasks.• Increase dfs block size (default 128MB in EMR) for large files.• Avoid read contention at S3 – have equal or more files in S3 compared to available mappers.• Use mapred output compression to save storage, processing time and bandwidth costs.• Set mapred task timeout to 0 if you have long running jobs (> 10 mins) and can disable speculative execution time.• Increase sort buffer and sort factor based on map tasks output.
    31. 31. Understand – EMR Cluster Metrics
    32. 32. Understand – EMR Cluster Metrics
    33. 33. Common Bottlenecks – Monitor Matters
    34. 34. Hadoop and EMR – What I have learned?• Code is god – If you have severe performance issues then look at your code 100 times, understand third party libraries used and rewrite in Java if required.• Streaming jobs are slow compared to Custom Jar jobs – Over head and scripting is good for adhoc-analysis.• Disk IO and Network IO effects your processing time.• Be ready to face variable performance in Cloud.• Monitor everything once in a while and keep benchmarking with data points.• Default settings are seldom optimal in EMR – unless you run simple jobs.• Focus on optimization as it’s the only way to save Cost and Time.
    35. 35. Hadoop and EMR – Performance Tuning Example• Streaming : Map reduce jobs were written using Ruby. Input dataset was 150 GB and output was around 4000 GB. Complex processing, highly CPU bound and Disk IO.• Time taken to complete job processing : 4000 m1.xlarge nodes and 180 minutes.• Rewrote the code in Java – job processing time was reduced to 70 minutes on just 400 m1.xlarge nodes.• Tuning EMR configuration has further reduced it to 32 minutes.• Focus on code first and then focus on configuration.
    36. 36. Q&A
    37. 37. Like what we do? – connect with me | | @kuliza @amnigos