Successfully reported this slideshow.

3rd meetup - Intro to Amazon EMR

2,569 views

Published on

Published in: Technology
  • Be the first to comment

3rd meetup - Intro to Amazon EMR

  1. 1. ‘Amazon EMR’ coming up…by Sujee Maniyam<br />
  2. 2. Birmingham Big Data Group<br />Amazon Elastic Map Reduce For Startups<br />Sujee Maniyam<br />s@sujee.net / www.sujee.net<br />Sept 14, 2011<br />
  3. 3. Hi, I’m Sujee<br />10+ years of software development<br />enterprise apps  web apps iphone apps  Hadoop<br />More : http://sujee.net/tech<br />
  4. 4. I am an ‘expert’ <br />
  5. 5. Ah.. Data<br />
  6. 6. Nature of Data…<br />Primary Data<br />Email, blogs, pictures, tweets<br />Critical for operation (Gmail can’t loose emails)<br />Secondary data<br />Wikipedia access logs, Google search logs<br />Not ‘critical’, but used to ‘enhance’ user experience<br />Search logs help predict ‘trends’<br />Yelp can figure out you like Chinese food<br />
  7. 7. Data Explosion<br />Primary data has grown phenomenally<br />But secondary data has exploded in recent years<br />“log every thing and ask questions later”<br />Used for<br />Recommendations (books, restaurants ..etc)<br />Predict trends (job skills in demand)<br />Show ADS ($$$)<br />..etc<br />‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)<br />Startups are struggling to get on top of ‘big data’ <br />
  8. 8. Big Guys<br />
  9. 9. Startups<br />
  10. 10. Startups and bigdata<br />
  11. 11. Hadoop to Rescue<br />Hadoop can help with BigData<br />Hadoop has been proven in the field<br />Under active development<br />Throw hardware at the problem<br />Getting cheaper by the year<br />Bleeding edge technology<br />Hire good people!<br />
  12. 12. Hadoop: It is a CAREER<br />
  13. 13. Data Spectrum<br />
  14. 14. Who is Using Hadoop?<br />
  15. 15. About This Presentation<br />Based on my experience with a startup<br />5 people (3 Engineers)<br />Ad-Serving Space<br />Amazon EC2 is our ‘data center’<br />Technologies:<br />Web stack : Python, Tornado, PHP, mysql , LAMP<br />Amazon EMR to crunch data<br />Data size : 1 TB / week<br />
  16. 16. Story of a Startup<br />We served targeted ads<br />Tons of click data<br />Stored them in mysqldb<br />Outgrew mysql pretty quickly<br />
  17. 17. Data @ 6 months<br />2 TB of data already<br />50-100 G new data / day <br />And we were operating at 20% of our capacity!<br />
  18. 18. Future…<br />
  19. 19. Solution?<br />Scalable database (NOSQL)<br />Hbase<br />Cassandra<br />Hadoop log processing / Map Reduce<br />
  20. 20. What We Evaluated<br />1) Hbase cluster<br />2) Hadoop cluster<br />3) Amazon EMR<br />
  21. 21. Hadoop on Amazon EC2<br />Two ways to run Hadoop on EC2<br />1) Permanent Cluster<br />2) On demand cluster (elastic map reduce)<br />
  22. 22. 1) Permanent Hadoop Cluster<br />
  23. 23. Architecture 1<br />
  24. 24. Hadoop Cluster<br />7 C1.xlarge machines<br />15 TB EBS volumes<br />Sqoop exports mysql log tables into HDFS<br />Logs are compressed (gz) to minimize disk usage (data locality trade-off)<br />All is working well…<br />
  25. 25. 2 months later<br />Couple of EBS volumes DIE<br />Couple of EC2 instances DIE<br />Maintaining the hadoop cluster is mechanical job less appealing<br />COST!<br />Our jobs utilization is about 50%<br />But still paying for machines running 24x7<br />
  26. 26. Lessons Learned<br />C1.xlarge is pretty stable (8 core / 8G memory)<br />EBS volumes<br />max size 1TB, so string few for higher density / node<br />DON’T RAID them; let hadoop handle them as individual disks<br />Might fail<br />Backup data on S3<br />Skip EBS. Use instance store disks, and store data in S3<br />Use Apache WHIRR to setup cluster easily<br />
  27. 27. Amazon Storage Options<br />
  28. 28. Amazon EC2 Cost<br />
  29. 29. Hadoop cluster on EC2 cost<br />$3,500 = 7 c1.xlarge @ $500 / month<br />$1,500 = 15 TB EBS storage @ $0.10 per GB<br />$ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests<br /> $5,500 / month<br />$60,000 / year !<br />
  30. 30. Buy / Rent ?<br />Typical hadoop machine cost : $10-15k<br />10 node cluster = $100k <br />Plus data center costs<br />Plus IT-ops costs<br />Amazon Ec2 10 node cluster:<br />$500 * 10 = $5,000 / month = $60k / year<br />
  31. 31. Buy / Rent<br />Amazon EC2 is great, for<br />Quickly getting started<br />Startups<br />Scaling on demand / rapidly adding more servers<br />popular social games<br />Netflix story<br />Streaming is powered by EC2<br />Encoding movies ..etc<br />Use 1000s of instances<br />Not so economical for running clusters 24x7<br />http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/<br />
  32. 32. Buy vs Rent<br />
  33. 33. Amazon’s Elastic Map Reduce<br />Basically ‘on demand’ hadoop cluster<br />Store data on Amazon S3<br />Kick off a hadoop cluster to process data<br />Shutdown when done<br />Pay for the HOURS used<br />
  34. 34. Architecture2 : Amazon EMR<br />
  35. 35. Moving parts<br />Logs go into Scribe<br />Scribe master ships logs into S3, gzipped<br />Spin EMR cluster, run job, done<br />Using same old Java MR jobs for EMR<br />Summary data gets directly updated to a mysql (no output files from reducers)<br />
  36. 36. EMR Wins<br />Cost  only pay for use<br />http://aws.amazon.com/elasticmapreduce/pricing/<br />Example: EMR ran on 5 C1.xlarge for 3hrs<br />EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20<br />http://aws.amazon.com/elasticmapreduce/faqs/#billing-4<br />(1 hour of c1.xlarge = 8 hours normalized compute time)<br />EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40<br />Plus S3 storage cost : 1TB / month = $150<br />Data bandwidth from S3 to EC2 is FREE!<br /> $25 bucks<br />
  37. 37. Design Wins<br />Bidders now write logs to Scribe directly <br />No mysql at web server machines<br />Writes much faster!<br />S3 has been a reliable storage and cheap<br />
  38. 38. EMR Wins<br />No hadoop cluster to maintainno failed nodes / disks<br />
  39. 39. EMR Wins<br />Hadoop clusters can be of any size!<br />Spin clusters for each job<br />smaller jobs  fewer number of machines<br />memory hungry tasks  m1.xlarge<br />cpu hungry tasks  c1.xlarge<br />
  40. 40. EMR trade-offs<br />Lower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk)<br />Streaming data from S3, for each job<br />EMR Hadoop is not the latest version<br />Missing tools : Oozie<br />Right now, trading performance for convenience and cost<br />
  41. 41. Lessons Learned<br />Debugging a failed MR job is tricky<br />Because the hadoop cluster is terminated  no logs files<br />Save log files to S3<br />
  42. 42. Lessons : Script every thing<br />scripts <br />to launch jar EMR jobs<br />Custom parameters depending on job needs (instance types, size of cluster ..etc)<br />monitor job progress<br />Save logs for later inspection<br />Job status (finished / cancelled)<br />https://github.com/sujee/amazon-emr-beyond-basics<br />
  43. 43. Saved Logs<br />
  44. 44. Map reduce tips: Logfile format<br />CSV  JSON<br />Started with CSV<br />CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL<br />20-40 fields… fragile, position dependant, hard to code <br />url = csv[18]…counting position numbers gets old after 100th time around)<br />If (csv.length == 29) url = csv[28] else url = csv[26]<br />
  45. 45. Map reduce tips: Logfile format<br />JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}<br />Self-describing, easy to add new fields, easy to process<br />url = map.get(‘url’)<br />Flatten JSON to fit in ONE LINE<br />Compresses pretty well (not much data inflation)<br />
  46. 46. Map reduce tips: Incremental Log Processing<br />Recent data (today / yesterday / this week) is more relevant than older data (6 months +)<br />
  47. 47. Map reduce tips: Incremental Log Processing<br />Adding ‘time window’ to our stats<br />only process newer logs faster<br />
  48. 48. Next steps<br />New Software<br />Pig, python mrJOB(from Yelp)<br />Scribe  Cloudera flume?<br />Use work flow tools like Oozie<br />Hive?<br />Adhoc SQL like queries<br />
  49. 49. Next Steps : SPOT instances<br />SPOT instances : name your price (ebay style)<br />Been available on EC2 for a while<br />Just became available for Elastic map reduce!<br />New cluster setup:<br />10 normal instances + 10 spot instances<br />Spots may go away anytime<br />That is fine! Hadoop will handle node failures<br />Bigger cluster : cheaper & faster<br />
  50. 50. Example Price Comparison<br />
  51. 51. In summary…<br />Amazon EMR could be a great solution<br />We are happy!<br />
  52. 52. Take a test drive<br />Just bring your credit-card <br />http://aws.amazon.com/elasticmapreduce/<br />Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52<br />
  53. 53. Thanks<br />Questions?<br />Sujee Maniyam<br />http://sujee.net<br />hello@sujee.net<br />Devil’s slide, Pacifica<br />

×