‘Amazon EMR’ coming up…by Sujee Maniyam<br />
Big Data Cloud Meetup<br />Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />hell...
Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />http://sujee.net<br />hello@suj...
Quiz<br />PRIZE!<br />Where was this picture taken?<br />
Quiz : Where was this picture taken?<br />
Answer : Montara Light House<br />
Hi, I’m Sujee<br />10+ years of software development<br />enterprise apps  web apps iphone apps   Hadoop<br />Hands on ...
I am  an ‘expert’ <br />
Ah.. Data<br />
Nature of Data…<br />Primary Data<br />Email, blogs, pictures, tweets<br />Critical for operation (Gmail can’t loose email...
Data Explosion<br />Primary data has grown phenomenally<br />But secondary data has exploded in recent years<br />“log eve...
Hadoop to Rescue<br />Hadoop can help with BigData<br />Hadoop has been proven in the field<br />Under active development<...
Hadoop: It is a CAREER<br />
Data Spectrum<br />
Who is Using Hadoop?<br />
Big Guys<br />
Startups<br />
Startups and bigdata<br />
About This Presentation<br />Based on my experience with a startup<br />5 people (3 Engineers)<br />Ad-Serving Space<br />...
Story of a Startup…month-1<br />Each web serverwrites logs locally<br />Logs were copiedto a log-serverand purged from web...
Story of a Startup…month-6<br />More web servers comeonline<br />Aggregate log serverfalls behind<br />
Data @ 6 months<br />2 TB of data already<br />50-100 G new data / day <br />And we were operating on 20% of our capacity!...
Future…<br />
Solution?<br />Scalable database (NOSQL)<br />Hbase<br />Cassandra<br />Hadoop log processing / Map Reduce<br />
What We Evaluated<br />1) Hbase cluster<br />2) Hadoop cluster<br />3) Amazon EMR<br />
Hadoop on Amazon EC2<br />1) Permanent Cluster<br />2) On demand cluster (elastic map reduce)<br />
1) Permanent Hadoop Cluster<br />
Architecture 1<br />
Hadoop Cluster<br />7 C1.xlarge machines<br />15 TB EBS volumes<br />Sqoop exports mysql log tables into HDFS<br />Logs ar...
Lessons Learned<br />C1.xlarge is  pretty stable (8 core / 8G memory)<br />EBS volumes<br />max size 1TB,  so string few f...
Amazon Storage Options<br />
2 months later<br />Couple of EBS volumes DIE<br />Couple of EC2 instances DIE<br />Maintaining the hadoop cluster is mech...
Amazon EC2 Cost<br />
Hadoop cluster on EC2 cost<br />$3,500 = 7 c1.xlarge @ $500 / month<br />$1,500 = 15 TB EBS storage @ $0.10 per GB<br />$ ...
Buy / Rent ?<br />Typical hadoop machine cost : $10k<br />10 node cluster = $100k <br />Plus data center  costs<br />Plus ...
Buy / Rent<br />Amazon EC2 is great, for<br />Quickly getting started<br />Startups<br />Scaling on demand / rapidly addin...
Next : Amazon EMR<br />
Where was this picture taken?<br />
Answer : Pacifica Pier<br />
Amazon’s solution :  Elastic Map Reduce<br />Store data on Amazon S3<br />Kick off a hadoop cluster to process data<br />S...
Architecture : Amazon EMR<br />
Moving parts<br />Logs go into Scribe<br />Scribe master ships logs into S3, gzipped<br />Spin EMR cluster, run job, done<...
EMR Launch Scripts<br />scripts <br />to launch jar EMR jobs<br />Custom parameters depending on job needs (instance types...
Sample Launch Script<br />#!/bin/bash<br />## run-sitestats4.sh<br /># config<br />MASTER_INSTANCE_TYPE="m1.large"<br />SL...
Mapred-config-m1-xl.xml	<br /><?xml version="1.0"?><br /><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <br /...
emr-wait-for-completion.sh<br />Polls for job status periodically<br />Saves the logs <br />Calculates job run time<br />
Saved Logs<br />
Sample Saved Log<br />
Data joining (x-ref)<br />Data is split across log files, need to x-ref during Map phase<br />Used to load the data in map...
X-ref<br />
EMR Wins<br />Cost   only pay for use<br />http://aws.amazon.com/elasticmapreduce/pricing/<br />Example: EMR ran on 5 C1....
EMR Wins<br />No hadoop cluster to maintainno failed nodes / disks<br />Bonus : Can tailor cluster  for various jobs<br />...
Design Wins<br />Bidders now write logs to Scribe directly <br />No mysql at web server machines<br />Writes much faster!<...
Next : Lessons Learned<br />
Where was this pic taken?<br />
Answer : Foster City<br />
Lessons learned : Logfile format<br />CSV  JSON<br />Started with CSV<br />CSV: "2","26","3","07807606-7637-41c0-9bc0-8d3...
Lessons Learned : Control the amount of Input<br />We get different type of events<br />event A (freq: 10,000)   >>> event...
Control Input…<br />So have to process the entire file, even if we are interested only in ‘event C’ too much wasted proce...
Lessons learned : Incremental Log Processing<br />Recent data (today / yesterday / this week) is more relevant than older ...
EMR trade-offs<br />Lower performance on MR jobs compared to a  clusterReduced data throughput (S3 isn’t the same as local...
Next steps : faster processing<br />Streaming S3 data for each MR job is not optimal<br />Spin cluster<br />Copy data from...
Next Steps : More Processing<br />More MR jobs<br />More frequent data processing<br />Frequent log rolls<br />Smaller del...
Next steps : new software <br />New Software<br />Python,  mrJOB(from Yelp)<br />Scribe  Cloudera flume?<br />Use work fl...
Next Steps : SPOT instances<br />SPOT instances : name your price (ebay style)<br />Been available on EC2 for a while<br /...
Example Price Comparison<br />
Next Steps : nosql<br />Summary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing)<...
Take a test drive<br />Just bring your credit-card <br />http://aws.amazon.com/elasticmapreduce/<br />Forum : https://for...
Thanks<br />Questions?<br />Sujee Maniyam<br />http://sujee.net<br />hello@sujee.net<br />Devil’s slide, Pacifica<br />
Upcoming SlideShare
Loading in …5
×

BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee

4,439 views
4,353 views

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,439
On SlideShare
0
From Embeds
0
Number of Embeds
2,554
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee

  1. 1. ‘Amazon EMR’ coming up…by Sujee Maniyam<br />
  2. 2. Big Data Cloud Meetup<br />Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />hello@sujee.net | www.sujee.net<br />July 08, 2011<br />
  3. 3. Cost Effective Big-Data Processing using Amazon Elastic Map Reduce<br />Sujee Maniyam<br />http://sujee.net<br />hello@sujee.net<br />
  4. 4. Quiz<br />PRIZE!<br />Where was this picture taken?<br />
  5. 5. Quiz : Where was this picture taken?<br />
  6. 6. Answer : Montara Light House<br />
  7. 7. Hi, I’m Sujee<br />10+ years of software development<br />enterprise apps  web apps iphone apps  Hadoop<br />Hands on experience with Hadoop / Hbase/ Amazon ‘cloud’<br />More : http://sujee.net/tech<br />
  8. 8. I am an ‘expert’ <br />
  9. 9. Ah.. Data<br />
  10. 10. Nature of Data…<br />Primary Data<br />Email, blogs, pictures, tweets<br />Critical for operation (Gmail can’t loose emails)<br />Secondary data<br />Wikipedia access logs, Google search logs<br />Not ‘critical’, but used to ‘enhance’ user experience<br />Search logs help predict ‘trends’<br />Yelp can figure out you like Chinese food<br />
  11. 11. Data Explosion<br />Primary data has grown phenomenally<br />But secondary data has exploded in recent years<br />“log every thing and ask questions later”<br />Used for<br />Recommendations (books, restaurants ..etc)<br />Predict trends (job skills in demand)<br />Show ADS ($$$)<br />..etc<br />‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)<br />Startups are struggling to get on top of ‘big data’ <br />
  12. 12. Hadoop to Rescue<br />Hadoop can help with BigData<br />Hadoop has been proven in the field<br />Under active development<br />Throw hardware at the problem<br />Getting cheaper by the year<br />Bleeding edge technology<br />Hire good people!<br />
  13. 13. Hadoop: It is a CAREER<br />
  14. 14. Data Spectrum<br />
  15. 15. Who is Using Hadoop?<br />
  16. 16. Big Guys<br />
  17. 17. Startups<br />
  18. 18. Startups and bigdata<br />
  19. 19. About This Presentation<br />Based on my experience with a startup<br />5 people (3 Engineers)<br />Ad-Serving Space<br />Amazon EC2 is our ‘data center’<br />Technologies:<br />Web stack : Python, Tornado, PHP, mysql , LAMP<br />Amazon EMR to crunch data<br />Data size : 1 TB / week<br />
  20. 20. Story of a Startup…month-1<br />Each web serverwrites logs locally<br />Logs were copiedto a log-serverand purged from web servers<br />Log Data size : ~100-200 G<br />
  21. 21. Story of a Startup…month-6<br />More web servers comeonline<br />Aggregate log serverfalls behind<br />
  22. 22. Data @ 6 months<br />2 TB of data already<br />50-100 G new data / day <br />And we were operating on 20% of our capacity!<br />
  23. 23. Future…<br />
  24. 24. Solution?<br />Scalable database (NOSQL)<br />Hbase<br />Cassandra<br />Hadoop log processing / Map Reduce<br />
  25. 25. What We Evaluated<br />1) Hbase cluster<br />2) Hadoop cluster<br />3) Amazon EMR<br />
  26. 26. Hadoop on Amazon EC2<br />1) Permanent Cluster<br />2) On demand cluster (elastic map reduce)<br />
  27. 27. 1) Permanent Hadoop Cluster<br />
  28. 28. Architecture 1<br />
  29. 29. Hadoop Cluster<br />7 C1.xlarge machines<br />15 TB EBS volumes<br />Sqoop exports mysql log tables into HDFS<br />Logs are compressed (gz) to minimize disk usage (data locality trade-off)<br />All is working well…<br />
  30. 30. Lessons Learned<br />C1.xlarge is pretty stable (8 core / 8G memory)<br />EBS volumes<br />max size 1TB, so string few for higher density / node<br />DON’T RAID them; let hadoop handle them as individual disks<br />?? : Skip EBS. Use instance store disks, and store data in S3<br />
  31. 31. Amazon Storage Options<br />
  32. 32. 2 months later<br />Couple of EBS volumes DIE<br />Couple of EC2 instances DIE<br />Maintaining the hadoop cluster is mechanical job less appealing<br />COST!<br />Our jobs utilization is about 50%<br />But still paying for machines running 24x7<br />
  33. 33. Amazon EC2 Cost<br />
  34. 34. Hadoop cluster on EC2 cost<br />$3,500 = 7 c1.xlarge @ $500 / month<br />$1,500 = 15 TB EBS storage @ $0.10 per GB<br />$ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests<br /> $5,500 / month<br />$60,000 / year !<br />
  35. 35. Buy / Rent ?<br />Typical hadoop machine cost : $10k<br />10 node cluster = $100k <br />Plus data center costs<br />Plus IT-ops costs<br />Amazon Ec2 10 node cluster:<br />$500 * 10 = $5,000 / month = $60k / year<br />
  36. 36. Buy / Rent<br />Amazon EC2 is great, for<br />Quickly getting started<br />Startups<br />Scaling on demand / rapidly adding more servers<br />popular social games<br />Netflix story<br />Streaming is powered by EC2<br />Encoding movies ..etc<br />Use 1000s of instances<br />Not so economical for running clusters 24x7<br />
  37. 37. Next : Amazon EMR<br />
  38. 38. Where was this picture taken?<br />
  39. 39. Answer : Pacifica Pier<br />
  40. 40. Amazon’s solution : Elastic Map Reduce<br />Store data on Amazon S3<br />Kick off a hadoop cluster to process data<br />Shutdown when done<br />Pay for the HOURS used<br />
  41. 41. Architecture : Amazon EMR<br />
  42. 42. Moving parts<br />Logs go into Scribe<br />Scribe master ships logs into S3, gzipped<br />Spin EMR cluster, run job, done<br />Using same old Java MR jobs for EMR<br />Summary data gets directly updated to a mysql<br />
  43. 43. EMR Launch Scripts<br />scripts <br />to launch jar EMR jobs<br />Custom parameters depending on job needs (instance types, size of cluster ..etc)<br />monitor job progress<br />Save logs for later inspection<br />Job status (finished / cancelled)<br />https://github.com/sujee/amazon-emr-beyond-basics<br />
  44. 44. Sample Launch Script<br />#!/bin/bash<br />## run-sitestats4.sh<br /># config<br />MASTER_INSTANCE_TYPE="m1.large"<br />SLAVE_INSTANCE_TYPE="c1.xlarge"<br />INSTANCES=5<br />export JOBNAME="SiteStats4"<br />export TIMESTAMP=$(date +%Y%m%d-%H%M%S)<br /># end config<br />echo "==========================================="<br />echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."<br />export t1=$(date +%s)<br />export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”)<br />sh ./emr-wait-for-completion.sh<br />
  45. 45. Mapred-config-m1-xl.xml <br /><?xml version="1.0"?><br /><?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <br /><configuration><br /> <property><br /> <name>mapreduce.map.java.opts</name><br /> <value>-Xmx1024M</value><br /> </property><br /> <property><br /> <name>mapreduce.reduce.java.opts</name><br /> <value>-Xmx3000M</value><br /> </property><br /> <property><br /> <name>mapred.tasktracker.reduce.tasks.maximum</name><br /> <value>3</value><br /> <decription>4 is running out of memory</description><br /> </property><br /><property><br /> <name>mapred.output.compress</name><br /> <value>true</value><br /></property><br /> <property><br /> <name>mapred.output.compression.type</name><br /> <value>BLOCK</value><br /> </property><br /></configuration><br />
  46. 46. emr-wait-for-completion.sh<br />Polls for job status periodically<br />Saves the logs <br />Calculates job run time<br />
  47. 47. Saved Logs<br />
  48. 48. Sample Saved Log<br />
  49. 49. Data joining (x-ref)<br />Data is split across log files, need to x-ref during Map phase<br />Used to load the data in mapper’s memory (data was small and in mysql)<br />Now we use Membase (Memcached)<br />Two MR jobs are chained<br />First one processes logfile_type_A and populates Membase (very quick, takes minutes)<br />Second one, processes logfile_type_B, cross-references values from Membase<br />
  50. 50. X-ref<br />
  51. 51. EMR Wins<br />Cost  only pay for use<br />http://aws.amazon.com/elasticmapreduce/pricing/<br />Example: EMR ran on 5 C1.xlarge for 3hrs<br />EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20<br />http://aws.amazon.com/elasticmapreduce/faqs/#billing-4<br />(1 hour of c1.xlarge = 8 hours normalized compute time)<br />EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40<br />Plus S3 storage cost : 1TB / month = $150<br />Data bandwidth from S3 to EC2 is FREE!<br /> $25 bucks<br />
  52. 52. EMR Wins<br />No hadoop cluster to maintainno failed nodes / disks<br />Bonus : Can tailor cluster for various jobs<br />smaller jobs  fewer number of machines<br />memory hungry tasks  m1.xlarge<br />cpu hungry tasks  c1.xlarge<br />
  53. 53. Design Wins<br />Bidders now write logs to Scribe directly <br />No mysql at web server machines<br />Writes much faster!<br />S3 has been a reliable storage and cheap<br />
  54. 54. Next : Lessons Learned<br />
  55. 55. Where was this pic taken?<br />
  56. 56. Answer : Foster City<br />
  57. 57. Lessons learned : Logfile format<br />CSV  JSON<br />Started with CSV<br />CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL<br />20-40 fields… fragile, position dependant, hard to code <br />url = csv[18]…counting position numbers gets old after 100th time around)<br />If (csv.length == 29) url = csv[28] else url = csv[26]<br />JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}<br />Self-describing, easy to add new fields, easy to process<br />url = map.get(‘url’)<br />
  58. 58. Lessons Learned : Control the amount of Input<br />We get different type of events<br />event A (freq: 10,000) >>> event B (100) >> event C (1)<br />Initially we put them all into a single log file<br />A<br />A<br />A<br />A<br />B<br />A<br />A<br />B<br />C<br />
  59. 59. Control Input…<br />So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing<br />So we split the logs<br />log_A….gz<br />log_B….gz<br />log_C…gz<br />Now only processing fraction of our logs<br />Input : s3://my_bucket/logs/log_B*<br />x-ref using memcache if needed<br />
  60. 60. Lessons learned : Incremental Log Processing<br />Recent data (today / yesterday / this week) is more relevant than older data (6 months +)<br />Adding ‘time window’ to our stats<br />only process newer logs faster<br />
  61. 61. EMR trade-offs<br />Lower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk)<br />Streaming data from S3, for each job<br />EMR Hadoop is not the latest version<br />Missing tools : Oozie<br />Right now, trading performance for convenience and cost<br />
  62. 62. Next steps : faster processing<br />Streaming S3 data for each MR job is not optimal<br />Spin cluster<br />Copy data from S3 to HDFS<br />Run all MR jobs (make use of data locality)<br />terminate<br />
  63. 63. Next Steps : More Processing<br />More MR jobs<br />More frequent data processing<br />Frequent log rolls<br />Smaller delta window<br />
  64. 64. Next steps : new software <br />New Software<br />Python, mrJOB(from Yelp)<br />Scribe  Cloudera flume?<br />Use work flow tools like Oozie<br />Hive?<br />Adhoc SQL like queries<br />
  65. 65. Next Steps : SPOT instances<br />SPOT instances : name your price (ebay style)<br />Been available on EC2 for a while<br />Just became available for Elastic map reduce!<br />New cluster setup:<br />10 normal instances + 10 spot instances<br />Spots may go away anytime<br />That is fine! Hadoop will handle node failures<br />Bigger cluster : cheaper & faster<br />
  66. 66. Example Price Comparison<br />
  67. 67. Next Steps : nosql<br />Summary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing)<br />Evaluating nosql solutionsusing Membase in limited capacity<br />Watch out for Amazon’s Hbase offering<br />
  68. 68. Take a test drive<br />Just bring your credit-card <br />http://aws.amazon.com/elasticmapreduce/<br />Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52<br />
  69. 69. Thanks<br />Questions?<br />Sujee Maniyam<br />http://sujee.net<br />hello@sujee.net<br />Devil’s slide, Pacifica<br />

×