Cost effective BigData Processing on Amazon EC2

13,354 views

Published on

This is the talk I gave at Big Data Cloud Meetup on July 08, 2011.
http://www.meetup.com/BigDataCloud/events/19869611/

Published in: Technology

Cost effective BigData Processing on Amazon EC2

  1. 1. Big Data Cloud Meetup<br />Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />s@sujee.net / www.sujee.net<br />July 08, 2011<br />
  2. 2. Hi, I’m Sujee<br />10+ years of software development<br />enterprise apps  web apps iphone apps  Hadoop<br />More : http://sujee.net/tech<br />
  3. 3. I am an ‘expert’ <br />
  4. 4. Quiz<br />PRIZE!<br />Where was this picture taken?<br />
  5. 5. Quiz : Where was this picture taken?<br />
  6. 6. Answer : Montara Light House<br />
  7. 7. Ah.. Data<br />
  8. 8. Nature of Data…<br />Primary Data<br />Email, blogs, pictures, tweets<br />Critical for operation (Gmail can’t loose emails)<br />Secondary data<br />Wikipedia access logs, Google search logs<br />Not ‘critical’, but used to ‘enhance’ user experience<br />Search logs help predict ‘trends’<br />Yelp can figure out you like Chinese food<br />
  9. 9. Data Explosion<br />Primary data has grown phenomenally<br />But secondary data has exploded in recent years<br />“log every thing and ask questions later”<br />Used for<br />Recommendations (books, restaurants ..etc)<br />Predict trends (job skills in demand)<br />Show ADS ($$$)<br />..etc<br />‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)<br />Startups are struggling to get on top of ‘big data’ <br />
  10. 10. Big Guys<br />
  11. 11. Startups<br />
  12. 12. Startups and bigdata<br />
  13. 13. Hadoop to Rescue<br />Hadoop can help with BigData<br />Hadoop has been proven in the field<br />Under active development<br />Throw hardware at the problem<br />Getting cheaper by the year<br />Bleeding edge technology<br />Hire good people!<br />
  14. 14. Hadoop: It is a CAREER<br />
  15. 15. Data Spectrum<br />
  16. 16. Who is Using Hadoop?<br />
  17. 17. About This Presentation<br />Based on my experience with a startup<br />5 people (3 Engineers)<br />Ad-Serving Space<br />Amazon EC2 is our ‘data center’<br />Technologies:<br />Web stack : Python, Tornado, PHP, mysql , LAMP<br />Amazon EMR to crunch data<br />Data size : 1 TB / week<br />
  18. 18. Story of a Startup…month-1<br />Each web serverwrites logs locally<br />Logs were copiedto a log-serverand purged from web servers<br />Log Data size : ~100-200 G<br />
  19. 19. Story of a Startup…month-6<br />More web servers comeonline<br />Aggregate log serverfalls behind<br />
  20. 20. Data @ 6 months<br />2 TB of data already<br />50-100 G new data / day <br />And we were operating at 20% of our capacity!<br />
  21. 21. Future…<br />
  22. 22. Solution?<br />Scalable database (NOSQL)<br />Hbase<br />Cassandra<br />Hadoop log processing / Map Reduce<br />
  23. 23. What We Evaluated<br />1) Hbase cluster<br />2) Hadoop cluster<br />3) Amazon EMR<br />
  24. 24. Hadoop on Amazon EC2<br />1) Permanent Cluster<br />2) On demand cluster (elastic map reduce)<br />
  25. 25. 1) Permanent Hadoop Cluster<br />
  26. 26. Architecture 1<br />
  27. 27. Hadoop Cluster<br />7 C1.xlarge machines<br />15 TB EBS volumes<br />Sqoop exports mysql log tables into HDFS<br />Logs are compressed (gz) to minimize disk usage (data locality trade-off)<br />All is working well…<br />
  28. 28. 2 months later<br />Couple of EBS volumes DIE<br />Couple of EC2 instances DIE<br />Maintaining the hadoop cluster is mechanical job less appealing<br />COST!<br />Our jobs utilization is about 50%<br />But still paying for machines running 24x7<br />
  29. 29. Lessons Learned<br />C1.xlarge is pretty stable (8 core / 8G memory)<br />EBS volumes<br />max size 1TB, so string few for higher density / node<br />DON’T RAID them; let hadoop handle them as individual disks<br />Might fail<br />Backup data on S3<br />Skip EBS. Use instance store disks, and store data in S3<br />Use Apache WHIRR to setup cluster easily<br />
  30. 30. Amazon Storage Options<br />
  31. 31. Amazon EC2 Cost<br />
  32. 32. Hadoop cluster on EC2 cost<br />$3,500 = 7 c1.xlarge @ $500 / month<br />$1,500 = 15 TB EBS storage @ $0.10 per GB<br />$ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests<br /> $5,500 / month<br />$60,000 / year !<br />
  33. 33. Buy / Rent ?<br />Typical hadoop machine cost : $10-15k<br />10 node cluster = $100k <br />Plus data center costs<br />Plus IT-ops costs<br />Amazon Ec2 10 node cluster:<br />$500 * 10 = $5,000 / month = $60k / year<br />
  34. 34. Buy / Rent<br />Amazon EC2 is great, for<br />Quickly getting started<br />Startups<br />Scaling on demand / rapidly adding more servers<br />popular social games<br />Netflix story<br />Streaming is powered by EC2<br />Encoding movies ..etc<br />Use 1000s of instances<br />Not so economical for running clusters 24x7<br />http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/<br />
  35. 35. Buy vs Rent<br />
  36. 36. Next : Amazon EMR<br />
  37. 37. Where was this picture taken?<br />
  38. 38. Answer : Pacifica Pier<br />
  39. 39. Amazon’s Elastic Map Reduce<br />Basically ‘on demand’ hadoop cluster<br />Store data on Amazon S3<br />Kick off a hadoop cluster to process data<br />Shutdown when done<br />Pay for the HOURS used<br />
  40. 40. Architecture2 : Amazon EMR<br />
  41. 41. Moving parts<br />Logs go into Scribe<br />Scribe master ships logs into S3, gzipped<br />Spin EMR cluster, run job, done<br />Using same old Java MR jobs for EMR<br />Summary data gets directly updated to a mysql (no output files from reducers)<br />
  42. 42. EMR Wins<br />Cost  only pay for use<br />http://aws.amazon.com/elasticmapreduce/pricing/<br />Example: EMR ran on 5 C1.xlarge for 3hrs<br />EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20<br />http://aws.amazon.com/elasticmapreduce/faqs/#billing-4<br />(1 hour of c1.xlarge = 8 hours normalized compute time)<br />EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40<br />Plus S3 storage cost : 1TB / month = $150<br />Data bandwidth from S3 to EC2 is FREE!<br /> $25 bucks<br />
  43. 43. Design Wins<br />Bidders now write logs to Scribe directly <br />No mysql at web server machines<br />Writes much faster!<br />S3 has been a reliable storage and cheap<br />
  44. 44. EMR Wins<br />No hadoop cluster to maintainno failed nodes / disks<br />
  45. 45. EMR Wins<br />Hadoop clusters can be of any size!<br />Can have multiple hadoop clusters<br />smaller jobs  fewer number of machines<br />memory hungry tasks  m1.xlarge<br />cpu hungry tasks  c1.xlarge<br />
  46. 46. EMR trade-offs<br />Lower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk)<br />Streaming data from S3, for each job<br />EMR Hadoop is not the latest version<br />Missing tools : Oozie<br />Right now, trading performance for convenience and cost<br />
  47. 47. Lessons Learned<br />Debugging a failed MR job is tricky<br />Because the hadoop cluster is terminated  no logs files<br />Save log files to S3<br />
  48. 48. Lessons : Script every thing<br />scripts <br />to launch jar EMR jobs<br />Custom parameters depending on job needs (instance types, size of cluster ..etc)<br />monitor job progress<br />Save logs for later inspection<br />Job status (finished / cancelled)<br />https://github.com/sujee/amazon-emr-beyond-basics<br />
  49. 49. Sample Launch Script<br />#!/bin/bash<br />## run-sitestats4.sh<br /># config<br />MASTER_INSTANCE_TYPE="m1.large"<br />SLAVE_INSTANCE_TYPE="c1.xlarge"<br />INSTANCES=5<br />export JOBNAME="SiteStats4"<br />export TIMESTAMP=$(date +%Y%m%d-%H%M%S)<br /># end config<br />echo "==========================================="<br />echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."<br />export t1=$(date +%s)<br />export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”)<br />sh ./emr-wait-for-completion.sh<br />
  50. 50. Lessons : tweak cluster for each job<br />Mapred-config-m1-xl.xml<br /><configuration><br /> <property><br /> <name>mapreduce.map.java.opts</name><br /> <value>-Xmx1024M</value><br /> </property><br /> <property><br /> <name>mapreduce.reduce.java.opts</name><br /> <value>-Xmx3000M</value><br /> </property><br /> <property><br /> <name>mapred.tasktracker.reduce.tasks.maximum</name><br /> <value>3</value><br /></property><br /><property><br /> <name>mapred.output.compress</name><br /> <value>true</value><br /></property><br /> <property><br /> <name>mapred.output.compression.type</name><br /> <value>BLOCK</value><br /> </property><br /></configuration><br />
  51. 51. Saved Logs<br />
  52. 52. Sample Saved Log<br />
  53. 53. Map reduce tips : Control the amount of Input<br />We get different type of events<br />event A (freq: 10,000) >>> event B (100) >> event C (1)<br />Initially we put them all into a single log file<br />A<br />A<br />A<br />A<br />B<br />A<br />A<br />B<br />C<br />
  54. 54. Control Input…<br />So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing<br />So we split the logs<br />log_A….gz<br />log_B….gz<br />log_C…gz<br />Now only processing fraction of our logs<br />Input : s3://my_bucket/logs/log_B*<br />x-ref using memcache if needed<br />
  55. 55. Map reduce tips: Data joining (x-ref)<br />Data is split across log files, need to x-ref during Map phase<br />Used to load the data in mapper’s memory (data was small and in mysql)<br />Now we use Membase (Memcached)<br />Two MR jobs are chained<br />First one processes logfile_type_A and populates Membase (very quick, takes minutes)<br />Second one, processes logfile_type_B, cross-references values from Membase<br />
  56. 56. X-ref<br />
  57. 57. Map reduce tips: Logfile format<br />CSV  JSON<br />Started with CSV<br />CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL<br />20-40 fields… fragile, position dependant, hard to code <br />url = csv[18]…counting position numbers gets old after 100th time around)<br />If (csv.length == 29) url = csv[28] else url = csv[26]<br />
  58. 58. Map reduce tips: Logfile format<br />JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}<br />Self-describing, easy to add new fields, easy to process<br />url = map.get(‘url’)<br />Flatten JSON to fit in ONE LINE<br />Compresses pretty well (not much data inflation)<br />
  59. 59. Map reduce tips: Incremental Log Processing<br />Recent data (today / yesterday / this week) is more relevant than older data (6 months +)<br />
  60. 60. Map reduce tips: Incremental Log Processing<br />Adding ‘time window’ to our stats<br />only process newer logs faster<br />
  61. 61. Next Steps<br />
  62. 62. Where was this pic taken?<br />
  63. 63. Answer : Foster City<br />
  64. 64. Next steps : faster processing<br />Streaming S3 data for each MR job is not optimal<br />Spin cluster<br />Copy data from S3 to HDFS<br />Run all MR jobs (make use of data locality)<br />terminate<br />
  65. 65. Next Steps : More Processing<br />More MR jobs<br />More frequent data processing<br />Frequent log rolls<br />Smaller delta window (1 hr / 15 mins)<br />
  66. 66. Next steps : new software <br />New Software<br />Pig, python mrJOB(from Yelp)<br />Scribe  Cloudera flume?<br />Use work flow tools like Oozie<br />Hive?<br />Adhoc SQL like queries<br />
  67. 67. Next Steps : SPOT instances<br />SPOT instances : name your price (ebay style)<br />Been available on EC2 for a while<br />Just became available for Elastic map reduce!<br />New cluster setup:<br />10 normal instances + 10 spot instances<br />Spots may go away anytime<br />That is fine! Hadoop will handle node failures<br />Bigger cluster : cheaper & faster<br />
  68. 68. Example Price Comparison<br />
  69. 69. In summary…<br />Amazon EMR could be a great solution<br />We are happy!<br />
  70. 70. Take a test drive<br />Just bring your credit-card <br />http://aws.amazon.com/elasticmapreduce/<br />Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52<br />
  71. 71. Thanks<br />Questions?<br />Sujee Maniyam<br />http://sujee.net<br />hello@sujee.net<br />Devil’s slide, Pacifica<br />

×