Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
‘Amazon EMR’ coming up…by Sujee Maniyam<br />
Big Data Cloud Meetup<br />Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />hell...
Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />http://sujee.net<br />hello@suj...
Quiz<br />PRIZE!<br />Where was this picture taken?<br />
Quiz : Where was this picture taken?<br />
Answer : Montara Light House<br />
Hi, I’m Sujee<br />10+ years of software development<br />enterprise apps  web apps iphone apps   Hadoop<br />Hands on ...
I am  an ‘expert’ <br />
Ah.. Data<br />
Nature of Data…<br />Primary Data<br />Email, blogs, pictures, tweets<br />Critical for operation (Gmail can’t loose email...
Data Explosion<br />Primary data has grown phenomenally<br />But secondary data has exploded in recent years<br />“log eve...
Hadoop to Rescue<br />Hadoop can help with BigData<br />Hadoop has been proven in the field<br />Under active development<...
Hadoop: It is a CAREER<br />
Data Spectrum<br />
Who is Using Hadoop?<br />
Big Guys<br />
Startups<br />
Startups and bigdata<br />
About This Presentation<br />Based on my experience with a startup<br />5 people (3 Engineers)<br />Ad-Serving Space<br />...
Story of a Startup…month-1<br />Each web serverwrites logs locally<br />Logs were copiedto a log-serverand purged from web...
Story of a Startup…month-6<br />More web servers comeonline<br />Aggregate log serverfalls behind<br />
Data @ 6 months<br />2 TB of data already<br />50-100 G new data / day <br />And we were operating on 20% of our capacity!...
Future…<br />
Solution?<br />Scalable database (NOSQL)<br />Hbase<br />Cassandra<br />Hadoop log processing / Map Reduce<br />
What We Evaluated<br />1) Hbase cluster<br />2) Hadoop cluster<br />3) Amazon EMR<br />
Hadoop on Amazon EC2<br />1) Permanent Cluster<br />2) On demand cluster (elastic map reduce)<br />
1) Permanent Hadoop Cluster<br />
Architecture 1<br />
Hadoop Cluster<br />7 C1.xlarge machines<br />15 TB EBS volumes<br />Sqoop exports mysql log tables into HDFS<br />Logs ar...
Lessons Learned<br />C1.xlarge is  pretty stable (8 core / 8G memory)<br />EBS volumes<br />max size 1TB,  so string few f...
Amazon Storage Options<br />
2 months later<br />Couple of EBS volumes DIE<br />Couple of EC2 instances DIE<br />Maintaining the hadoop cluster is mech...
Amazon EC2 Cost<br />
Hadoop cluster on EC2 cost<br />$3,500 = 7 c1.xlarge @ $500 / month<br />$1,500 = 15 TB EBS storage @ $0.10 per GB<br />$ ...
Buy / Rent ?<br />Typical hadoop machine cost : $10k<br />10 node cluster = $100k <br />Plus data center  costs<br />Plus ...
Buy / Rent<br />Amazon EC2 is great, for<br />Quickly getting started<br />Startups<br />Scaling on demand / rapidly addin...
Next : Amazon EMR<br />
Where was this picture taken?<br />
Answer : Pacifica Pier<br />
Amazon’s solution :  Elastic Map Reduce<br />Store data on Amazon S3<br />Kick off a hadoop cluster to process data<br />S...
Architecture : Amazon EMR<br />
Moving parts<br />Logs go into Scribe<br />Scribe master ships logs into S3, gzipped<br />Spin EMR cluster, run job, done<...
EMR Launch Scripts<br />scripts <br />to launch jar EMR jobs<br />Custom parameters depending on job needs (instance types...
Sample Launch Script<br />#!/bin/bash<br />## run-sitestats4.sh<br /># config<br />MASTER_INSTANCE_TYPE="m1.large"<br />SL...
Mapred-config-m1-xl.xml	<br /><?xml version="1.0"?><br /><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <br /...
emr-wait-for-completion.sh<br />Polls for job status periodically<br />Saves the logs <br />Calculates job run time<br />
Saved Logs<br />
Sample Saved Log<br />
Data joining (x-ref)<br />Data is split across log files, need to x-ref during Map phase<br />Used to load the data in map...
X-ref<br />
EMR Wins<br />Cost   only pay for use<br />http://aws.amazon.com/elasticmapreduce/pricing/<br />Example: EMR ran on 5 C1....
EMR Wins<br />No hadoop cluster to maintainno failed nodes / disks<br />Bonus : Can tailor cluster  for various jobs<br />...
Design Wins<br />Bidders now write logs to Scribe directly <br />No mysql at web server machines<br />Writes much faster!<...
Next : Lessons Learned<br />
Where was this pic taken?<br />
Answer : Foster City<br />
Lessons learned : Logfile format<br />CSV  JSON<br />Started with CSV<br />CSV: "2","26","3","07807606-7637-41c0-9bc0-8d3...
Lessons Learned : Control the amount of Input<br />We get different type of events<br />event A (freq: 10,000)   >>> event...
Control Input…<br />So have to process the entire file, even if we are interested only in ‘event C’ too much wasted proce...
Lessons learned : Incremental Log Processing<br />Recent data (today / yesterday / this week) is more relevant than older ...
EMR trade-offs<br />Lower performance on MR jobs compared to a  clusterReduced data throughput (S3 isn’t the same as local...
Next steps : faster processing<br />Streaming S3 data for each MR job is not optimal<br />Spin cluster<br />Copy data from...
Next Steps : More Processing<br />More MR jobs<br />More frequent data processing<br />Frequent log rolls<br />Smaller del...
Next steps : new software <br />New Software<br />Python,  mrJOB(from Yelp)<br />Scribe  Cloudera flume?<br />Use work fl...
Next Steps : SPOT instances<br />SPOT instances : name your price (ebay style)<br />Been available on EC2 for a while<br /...
Example Price Comparison<br />
Next Steps : nosql<br />Summary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing)<...
Take a test drive<br />Just bring your credit-card <br />http://aws.amazon.com/elasticmapreduce/<br />Forum : https://for...
Thanks<br />Questions?<br />Sujee Maniyam<br />http://sujee.net<br />hello@sujee.net<br />Devil’s slide, Pacifica<br />
Upcoming SlideShare
Loading in …5
×

BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee

4,504 views

Published on

Published in: Technology, Business
  • Be the first to comment

BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee

  1. 1. ‘Amazon EMR’ coming up…by Sujee Maniyam<br />
  2. 2. Big Data Cloud Meetup<br />Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />hello@sujee.net | www.sujee.net<br />July 08, 2011<br />
  3. 3. Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />http://sujee.net<br />hello@sujee.net<br />
  4. 4. Quiz<br />PRIZE!<br />Where was this picture taken?<br />
  5. 5. Quiz : Where was this picture taken?<br />
  6. 6. Answer : Montara Light House<br />
  7. 7. Hi, I’m Sujee<br />10+ years of software development<br />enterprise apps  web apps iphone apps  Hadoop<br />Hands on experience with Hadoop / Hbase/ Amazon ‘cloud’<br />More : http://sujee.net/tech<br />
  8. 8. I am an ‘expert’ <br />
  9. 9. Ah.. Data<br />
  10. 10. Nature of Data…<br />Primary Data<br />Email, blogs, pictures, tweets<br />Critical for operation (Gmail can’t loose emails)<br />Secondary data<br />Wikipedia access logs, Google search logs<br />Not ‘critical’, but used to ‘enhance’ user experience<br />Search logs help predict ‘trends’<br />Yelp can figure out you like Chinese food<br />
  11. 11. Data Explosion<br />Primary data has grown phenomenally<br />But secondary data has exploded in recent years<br />“log every thing and ask questions later”<br />Used for<br />Recommendations (books, restaurants ..etc)<br />Predict trends (job skills in demand)<br />Show ADS ($$$)<br />..etc<br />‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)<br />Startups are struggling to get on top of ‘big data’ <br />
  12. 12. Hadoop to Rescue<br />Hadoop can help with BigData<br />Hadoop has been proven in the field<br />Under active development<br />Throw hardware at the problem<br />Getting cheaper by the year<br />Bleeding edge technology<br />Hire good people!<br />
  13. 13. Hadoop: It is a CAREER<br />
  14. 14. Data Spectrum<br />
  15. 15. Who is Using Hadoop?<br />
  16. 16. Big Guys<br />
  17. 17. Startups<br />
  18. 18. Startups and bigdata<br />
  19. 19. About This Presentation<br />Based on my experience with a startup<br />5 people (3 Engineers)<br />Ad-Serving Space<br />Amazon EC2 is our ‘data center’<br />Technologies:<br />Web stack : Python, Tornado, PHP, mysql , LAMP<br />Amazon EMR to crunch data<br />Data size : 1 TB / week<br />
  20. 20. Story of a Startup…month-1<br />Each web serverwrites logs locally<br />Logs were copiedto a log-serverand purged from web servers<br />Log Data size : ~100-200 G<br />
  21. 21. Story of a Startup…month-6<br />More web servers comeonline<br />Aggregate log serverfalls behind<br />
  22. 22. Data @ 6 months<br />2 TB of data already<br />50-100 G new data / day <br />And we were operating on 20% of our capacity!<br />
  23. 23. Future…<br />
  24. 24. Solution?<br />Scalable database (NOSQL)<br />Hbase<br />Cassandra<br />Hadoop log processing / Map Reduce<br />
  25. 25. What We Evaluated<br />1) Hbase cluster<br />2) Hadoop cluster<br />3) Amazon EMR<br />
  26. 26. Hadoop on Amazon EC2<br />1) Permanent Cluster<br />2) On demand cluster (elastic map reduce)<br />
  27. 27. 1) Permanent Hadoop Cluster<br />
  28. 28. Architecture 1<br />
  29. 29. Hadoop Cluster<br />7 C1.xlarge machines<br />15 TB EBS volumes<br />Sqoop exports mysql log tables into HDFS<br />Logs are compressed (gz) to minimize disk usage (data locality trade-off)<br />All is working well…<br />
  30. 30. Lessons Learned<br />C1.xlarge is pretty stable (8 core / 8G memory)<br />EBS volumes<br />max size 1TB, so string few for higher density / node<br />DON’T RAID them; let hadoop handle them as individual disks<br />?? : Skip EBS. Use instance store disks, and store data in S3<br />
  31. 31. Amazon Storage Options<br />
  32. 32. 2 months later<br />Couple of EBS volumes DIE<br />Couple of EC2 instances DIE<br />Maintaining the hadoop cluster is mechanical job less appealing<br />COST!<br />Our jobs utilization is about 50%<br />But still paying for machines running 24x7<br />
  33. 33. Amazon EC2 Cost<br />
  34. 34. Hadoop cluster on EC2 cost<br />$3,500 = 7 c1.xlarge @ $500 / month<br />$1,500 = 15 TB EBS storage @ $0.10 per GB<br />$ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests<br /> $5,500 / month<br />$60,000 / year !<br />
  35. 35. Buy / Rent ?<br />Typical hadoop machine cost : $10k<br />10 node cluster = $100k <br />Plus data center costs<br />Plus IT-ops costs<br />Amazon Ec2 10 node cluster:<br />$500 * 10 = $5,000 / month = $60k / year<br />
  36. 36. Buy / Rent<br />Amazon EC2 is great, for<br />Quickly getting started<br />Startups<br />Scaling on demand / rapidly adding more servers<br />popular social games<br />Netflix story<br />Streaming is powered by EC2<br />Encoding movies ..etc<br />Use 1000s of instances<br />Not so economical for running clusters 24x7<br />
  37. 37. Next : Amazon EMR<br />
  38. 38. Where was this picture taken?<br />
  39. 39. Answer : Pacifica Pier<br />
  40. 40. Amazon’s solution : Elastic Map Reduce<br />Store data on Amazon S3<br />Kick off a hadoop cluster to process data<br />Shutdown when done<br />Pay for the HOURS used<br />
  41. 41. Architecture : Amazon EMR<br />
  42. 42. Moving parts<br />Logs go into Scribe<br />Scribe master ships logs into S3, gzipped<br />Spin EMR cluster, run job, done<br />Using same old Java MR jobs for EMR<br />Summary data gets directly updated to a mysql<br />
  43. 43. EMR Launch Scripts<br />scripts <br />to launch jar EMR jobs<br />Custom parameters depending on job needs (instance types, size of cluster ..etc)<br />monitor job progress<br />Save logs for later inspection<br />Job status (finished / cancelled)<br />https://github.com/sujee/amazon-emr-beyond-basics<br />
  44. 44. Sample Launch Script<br />#!/bin/bash<br />## run-sitestats4.sh<br /># config<br />MASTER_INSTANCE_TYPE="m1.large"<br />SLAVE_INSTANCE_TYPE="c1.xlarge"<br />INSTANCES=5<br />export JOBNAME="SiteStats4"<br />export TIMESTAMP=$(date +%Y%m%d-%H%M%S)<br /># end config<br />echo "==========================================="<br />echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."<br />export t1=$(date +%s)<br />export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”)<br />sh ./emr-wait-for-completion.sh<br />
  45. 45. Mapred-config-m1-xl.xml <br /><?xml version="1.0"?><br /><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <br /><configuration><br /> <property><br /> <name>mapreduce.map.java.opts</name><br /> <value>-Xmx1024M</value><br /> </property><br /> <property><br /> <name>mapreduce.reduce.java.opts</name><br /> <value>-Xmx3000M</value><br /> </property><br /> <property><br /> <name>mapred.tasktracker.reduce.tasks.maximum</name><br /> <value>3</value><br /> <decription>4 is running out of memory</description><br /> </property><br /><property><br /> <name>mapred.output.compress</name><br /> <value>true</value><br /></property><br /> <property><br /> <name>mapred.output.compression.type</name><br /> <value>BLOCK</value><br /> </property><br /></configuration><br />
  46. 46. emr-wait-for-completion.sh<br />Polls for job status periodically<br />Saves the logs <br />Calculates job run time<br />
  47. 47. Saved Logs<br />
  48. 48. Sample Saved Log<br />
  49. 49. Data joining (x-ref)<br />Data is split across log files, need to x-ref during Map phase<br />Used to load the data in mapper’s memory (data was small and in mysql)<br />Now we use Membase (Memcached)<br />Two MR jobs are chained<br />First one processes logfile_type_A and populates Membase (very quick, takes minutes)<br />Second one, processes logfile_type_B, cross-references values from Membase<br />
  50. 50. X-ref<br />
  51. 51. EMR Wins<br />Cost  only pay for use<br />http://aws.amazon.com/elasticmapreduce/pricing/<br />Example: EMR ran on 5 C1.xlarge for 3hrs<br />EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20<br />http://aws.amazon.com/elasticmapreduce/faqs/#billing-4<br />(1 hour of c1.xlarge = 8 hours normalized compute time)<br />EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40<br />Plus S3 storage cost : 1TB / month = $150<br />Data bandwidth from S3 to EC2 is FREE!<br /> $25 bucks<br />
  52. 52. EMR Wins<br />No hadoop cluster to maintainno failed nodes / disks<br />Bonus : Can tailor cluster for various jobs<br />smaller jobs  fewer number of machines<br />memory hungry tasks  m1.xlarge<br />cpu hungry tasks  c1.xlarge<br />
  53. 53. Design Wins<br />Bidders now write logs to Scribe directly <br />No mysql at web server machines<br />Writes much faster!<br />S3 has been a reliable storage and cheap<br />
  54. 54. Next : Lessons Learned<br />
  55. 55. Where was this pic taken?<br />
  56. 56. Answer : Foster City<br />
  57. 57. Lessons learned : Logfile format<br />CSV  JSON<br />Started with CSV<br />CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL<br />20-40 fields… fragile, position dependant, hard to code <br />url = csv[18]…counting position numbers gets old after 100th time around)<br />If (csv.length == 29) url = csv[28] else url = csv[26]<br />JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}<br />Self-describing, easy to add new fields, easy to process<br />url = map.get(‘url’)<br />
  58. 58. Lessons Learned : Control the amount of Input<br />We get different type of events<br />event A (freq: 10,000) >>> event B (100) >> event C (1)<br />Initially we put them all into a single log file<br />A<br />A<br />A<br />A<br />B<br />A<br />A<br />B<br />C<br />
  59. 59. Control Input…<br />So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing<br />So we split the logs<br />log_A….gz<br />log_B….gz<br />log_C…gz<br />Now only processing fraction of our logs<br />Input : s3://my_bucket/logs/log_B*<br />x-ref using memcache if needed<br />
  60. 60. Lessons learned : Incremental Log Processing<br />Recent data (today / yesterday / this week) is more relevant than older data (6 months +)<br />Adding ‘time window’ to our stats<br />only process newer logs faster<br />
  61. 61. EMR trade-offs<br />Lower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk)<br />Streaming data from S3, for each job<br />EMR Hadoop is not the latest version<br />Missing tools : Oozie<br />Right now, trading performance for convenience and cost<br />
  62. 62. Next steps : faster processing<br />Streaming S3 data for each MR job is not optimal<br />Spin cluster<br />Copy data from S3 to HDFS<br />Run all MR jobs (make use of data locality)<br />terminate<br />
  63. 63. Next Steps : More Processing<br />More MR jobs<br />More frequent data processing<br />Frequent log rolls<br />Smaller delta window<br />
  64. 64. Next steps : new software <br />New Software<br />Python, mrJOB(from Yelp)<br />Scribe  Cloudera flume?<br />Use work flow tools like Oozie<br />Hive?<br />Adhoc SQL like queries<br />
  65. 65. Next Steps : SPOT instances<br />SPOT instances : name your price (ebay style)<br />Been available on EC2 for a while<br />Just became available for Elastic map reduce!<br />New cluster setup:<br />10 normal instances + 10 spot instances<br />Spots may go away anytime<br />That is fine! Hadoop will handle node failures<br />Bigger cluster : cheaper & faster<br />
  66. 66. Example Price Comparison<br />
  67. 67. Next Steps : nosql<br />Summary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing)<br />Evaluating nosql solutionsusing Membase in limited capacity<br />Watch out for Amazon’s Hbase offering<br />
  68. 68. Take a test drive<br />Just bring your credit-card <br />http://aws.amazon.com/elasticmapreduce/<br />Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52<br />
  69. 69. Thanks<br />Questions?<br />Sujee Maniyam<br />http://sujee.net<br />hello@sujee.net<br />Devil’s slide, Pacifica<br />

×