Big Data Cloud MeetupCost Effective Big-Data Processing using Amazon Elastic Map ReduceSujee Maniyams@sujee.net   /  www.sujee.netJuly 08, 2011
Hi, I’m Sujee10+ years of software developmententerprise apps  web apps iphone apps   HadoopMore : http://sujee.net/tech
I am  an ‘expert’ 
QuizPRIZE!Where was this picture taken?
Quiz : Where was this picture taken?
Answer : Montara Light House
Ah.. Data
Nature of Data…Primary DataEmail, blogs, pictures, tweetsCritical for operation (Gmail can’t loose emails)Secondary dataWikipedia access logs, Google search logsNot ‘critical’, but  used to ‘enhance’  user experienceSearch logs help predict ‘trends’Yelp can figure out you like Chinese food
Data ExplosionPrimary data has grown phenomenallyBut secondary data has exploded in recent years“log every thing and ask questions later”Used forRecommendations (books, restaurants ..etc)Predict trends (job skills in demand)Show ADS  ($$$)..etc‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)Startups are struggling to get on top of ‘big data’
Big Guys
Startups
Startups and bigdata
Hadoop to RescueHadoop can help with BigDataHadoop has been proven in the fieldUnder active developmentThrow hardware at the problemGetting cheaper by the yearBleeding edge technologyHire good people!
Hadoop: It is a CAREER
Data Spectrum
Who is Using Hadoop?
About This PresentationBased on my experience with a startup5 people (3 Engineers)Ad-Serving SpaceAmazon EC2 is our ‘data center’Technologies:Web stack : Python, Tornado,  PHP,  mysql , LAMPAmazon EMR to crunch dataData size : 1 TB  / week
Story of a Startup…month-1Each web serverwrites logs locallyLogs were copiedto a log-serverand purged from web serversLog Data size : ~100-200 G
Story of a Startup…month-6More web servers comeonlineAggregate log serverfalls behind
Data @ 6 months2 TB of data already50-100 G new data / day And we were operating at 20% of our capacity!
Future…
Solution?Scalable database (NOSQL)HbaseCassandraHadoop log processing / Map Reduce
What We Evaluated1) Hbase cluster2) Hadoop cluster3) Amazon EMR
Hadoop on Amazon EC21) Permanent Cluster2) On demand cluster (elastic map reduce)
1) Permanent Hadoop Cluster
Architecture 1
Hadoop Cluster7 C1.xlarge machines15 TB EBS volumesSqoop exports mysql log tables into HDFSLogs are compressed (gz) to minimize disk usage (data locality trade-off)All is working well…
2 months laterCouple of EBS volumes DIECouple of EC2 instances DIEMaintaining the hadoop cluster is mechanical  job  less appealingCOST!Our jobs utilization is about 50%But still paying for machines running 24x7
Lessons LearnedC1.xlarge is  pretty stable (8 core / 8G memory)EBS volumesmax size 1TB,  so string few for higher density / nodeDON’T RAID them; let hadoop handle them as individual disksMight failBackup data on S3Skip EBS.  Use instance store disks, and store data in S3Use Apache WHIRR to setup cluster easily
Amazon Storage Options
Amazon EC2 Cost
Hadoop cluster on EC2 cost$3,500 = 7 c1.xlarge @ $500 / month$1,500 = 15 TB EBS storage @ $0.10 per GB$ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests  $5,500 / month$60,000 / year !
Buy / Rent ?Typical hadoop machine cost : $10-15k10 node cluster = $100k Plus data center  costsPlus IT-ops costsAmazon Ec2 10 node cluster:$500 * 10 = $5,000 / month = $60k / year
Buy / RentAmazon EC2 is great, forQuickly getting startedStartupsScaling on demand / rapidly adding more serverspopular social gamesNetflix storyStreaming is powered by EC2Encoding movies ..etcUse 1000s of instancesNot so economical for running clusters 24x7http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/
Buy vs Rent
Next : Amazon EMR
Where was this picture taken?
Answer : Pacifica Pier
Amazon’s Elastic Map ReduceBasically ‘on demand’ hadoop clusterStore data on Amazon S3Kick off a hadoop cluster to process dataShutdown when donePay for the HOURS used
Architecture2 : Amazon EMR
Moving partsLogs go into ScribeScribe master ships logs into S3, gzippedSpin EMR cluster, run job, doneUsing same old Java MR jobs for EMRSummary data gets directly updated to a mysql (no output files from reducers)
EMR WinsCost   only pay for usehttp://aws.amazon.com/elasticmapreduce/pricing/Example: EMR ran on 5 C1.xlarge for 3hrsEC2 instances for 3 hrs = $0.68  per hr x 5 inst x 3 hrs = $10.20http://aws.amazon.com/elasticmapreduce/faqs/#billing-4(1 hour of c1.xlarge = 8 hours normalized compute time)EMR cost = 5 instances x 3 hrs x 8 normalized hrs x  0.12 emr = $14.40Plus S3 storage cost :  1TB / month = $150Data bandwidth from S3 to EC2 is FREE! $25 bucks
Design WinsBidders now write logs to Scribe directly No mysql at web server machinesWrites much faster!S3 has been a reliable  storage and cheap
EMR WinsNo hadoop cluster to maintainno failed nodes / disks
EMR WinsHadoop clusters can be of any size!Can have multiple hadoop clusterssmaller jobs  fewer number of machinesmemory hungry tasks  m1.xlargecpu hungry tasks  c1.xlarge
EMR trade-offsLower performance on MR jobs compared to a  clusterReduced data throughput (S3 isn’t the same as local disk)Streaming data from S3, for each jobEMR Hadoop is not the latest versionMissing tools : OozieRight now, trading performance for convenience and cost
Lessons LearnedDebugging a failed MR job is trickyBecause the hadoop cluster is terminated  no logs filesSave log files to S3
Lessons : Script every thingscripts to launch jar EMR jobsCustom parameters depending on job needs (instance types, size of cluster ..etc)monitor  job progressSave logs for later inspectionJob status (finished / cancelled)https://github.com/sujee/amazon-emr-beyond-basics
Sample Launch Script#!/bin/bash## run-sitestats4.sh# configMASTER_INSTANCE_TYPE="m1.large"SLAVE_INSTANCE_TYPE="c1.xlarge"INSTANCES=5export JOBNAME="SiteStats4"export TIMESTAMP=$(date +%Y%m%d-%H%M%S)# end configecho "==========================================="echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."export t1=$(date +%s)export JOBID=$(elastic-mapreduce --plain-output  --create --name "${JOBNAME}__${TIMESTAMP}"   --num-instances "$INSTANCES"  --master-instance-type "$MASTER_INSTANCE_TYPE"  --slave-instance-type "$SLAVE_INSTANCE_TYPE"  --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config  --log-uri s3://my_bucket/emr-logs/   --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”)sh ./emr-wait-for-completion.sh
Lessons : tweak cluster for each jobMapred-config-m1-xl.xml<configuration>    <property>        <name>mapreduce.map.java.opts</name>        <value>-Xmx1024M</value>    </property>    <property>        <name>mapreduce.reduce.java.opts</name>        <value>-Xmx3000M</value>    </property>    <property>        <name>mapred.tasktracker.reduce.tasks.maximum</name>        <value>3</value></property><property>        <name>mapred.output.compress</name>        <value>true</value></property>    <property>        <name>mapred.output.compression.type</name>        <value>BLOCK</value>    </property></configuration>
Saved Logs
Sample Saved Log
Map reduce tips : Control the amount of InputWe get different type of eventsevent A (freq: 10,000)   >>> event B (100)  >> event C (1)Initially we put them all into a single log fileAAAABAABC
Control Input…So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processingSo we split the logslog_A….gzlog_B….gzlog_C…gzNow only processing fraction of our logsInput : s3://my_bucket/logs/log_B*x-ref using memcache if needed
Map reduce tips: Data joining (x-ref)Data is split across log files, need to x-ref during Map phaseUsed to load the data in mapper’s memory (data was small and in mysql)Now we use Membase  (Memcached)Two MR jobs are chainedFirst one processes logfile_type_A and populates Membase (very quick,  takes minutes)Second one, processes logfile_type_B, cross-references values from Membase
X-ref
Map reduce tips: Logfile formatCSV  JSONStarted with CSVCSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL20-40 fields… fragile, position dependant, hard to code url = csv[18]…counting position numbers gets old after 100th time around)If (csv.length == 29) url = csv[28]     else url = csv[26]
Map reduce tips: Logfile formatJSON: { exchange_id: 2,  url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}Self-describing,  easy to add new fields, easy to processurl = map.get(‘url’)Flatten JSON to fit in ONE LINECompresses pretty well (not much data inflation)
Map reduce tips: Incremental Log ProcessingRecent data (today / yesterday / this week) is more relevant than older data (6 months +)
Map reduce tips: Incremental Log ProcessingAdding ‘time window’ to our statsonly process newer logs faster
Next Steps
Where was this pic taken?
Answer : Foster City
Next steps : faster processingStreaming S3 data for each MR job is not optimalSpin clusterCopy data from S3 to HDFSRun all MR jobs (make use of data locality)terminate
Next Steps : More ProcessingMore MR jobsMore frequent data processingFrequent log rollsSmaller delta window (1 hr / 15 mins)
Next steps : new software New SoftwarePig,  python  mrJOB(from Yelp)Scribe  Cloudera flume?Use work flow tools like OozieHive?Adhoc SQL like queries
Next Steps : SPOT instancesSPOT instances : name your price (ebay style)Been available on EC2 for a whileJust became available for Elastic map reduce!New cluster setup:10 normal instances + 10 spot instancesSpots may go away anytimeThat is fine!  Hadoop will handle node failuresBigger cluster : cheaper & faster
Example Price Comparison
In summary…Amazon EMR could be a great solutionWe are happy!
Take a test driveJust bring your credit-card http://aws.amazon.com/elasticmapreduce/Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52
ThanksQuestions?Sujee Maniyamhttp://sujee.nethello@sujee.netDevil’s slide, Pacifica

Cost effective BigData Processing on Amazon EC2

  • 1.
    Big Data CloudMeetupCost Effective Big-Data Processing using Amazon Elastic Map ReduceSujee Maniyams@sujee.net / www.sujee.netJuly 08, 2011
  • 2.
    Hi, I’m Sujee10+years of software developmententerprise apps  web apps iphone apps  HadoopMore : http://sujee.net/tech
  • 3.
    I am an ‘expert’ 
  • 4.
  • 5.
    Quiz : Wherewas this picture taken?
  • 6.
    Answer : MontaraLight House
  • 7.
  • 8.
    Nature of Data…PrimaryDataEmail, blogs, pictures, tweetsCritical for operation (Gmail can’t loose emails)Secondary dataWikipedia access logs, Google search logsNot ‘critical’, but used to ‘enhance’ user experienceSearch logs help predict ‘trends’Yelp can figure out you like Chinese food
  • 9.
    Data ExplosionPrimary datahas grown phenomenallyBut secondary data has exploded in recent years“log every thing and ask questions later”Used forRecommendations (books, restaurants ..etc)Predict trends (job skills in demand)Show ADS ($$$)..etc‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)Startups are struggling to get on top of ‘big data’
  • 10.
  • 11.
  • 12.
  • 13.
    Hadoop to RescueHadoopcan help with BigDataHadoop has been proven in the fieldUnder active developmentThrow hardware at the problemGetting cheaper by the yearBleeding edge technologyHire good people!
  • 14.
  • 15.
  • 16.
  • 17.
    About This PresentationBasedon my experience with a startup5 people (3 Engineers)Ad-Serving SpaceAmazon EC2 is our ‘data center’Technologies:Web stack : Python, Tornado, PHP, mysql , LAMPAmazon EMR to crunch dataData size : 1 TB / week
  • 18.
    Story of aStartup…month-1Each web serverwrites logs locallyLogs were copiedto a log-serverand purged from web serversLog Data size : ~100-200 G
  • 19.
    Story of aStartup…month-6More web servers comeonlineAggregate log serverfalls behind
  • 20.
    Data @ 6months2 TB of data already50-100 G new data / day And we were operating at 20% of our capacity!
  • 21.
  • 22.
  • 23.
    What We Evaluated1)Hbase cluster2) Hadoop cluster3) Amazon EMR
  • 24.
    Hadoop on AmazonEC21) Permanent Cluster2) On demand cluster (elastic map reduce)
  • 25.
  • 26.
  • 27.
    Hadoop Cluster7 C1.xlargemachines15 TB EBS volumesSqoop exports mysql log tables into HDFSLogs are compressed (gz) to minimize disk usage (data locality trade-off)All is working well…
  • 28.
    2 months laterCoupleof EBS volumes DIECouple of EC2 instances DIEMaintaining the hadoop cluster is mechanical job less appealingCOST!Our jobs utilization is about 50%But still paying for machines running 24x7
  • 29.
    Lessons LearnedC1.xlarge is pretty stable (8 core / 8G memory)EBS volumesmax size 1TB, so string few for higher density / nodeDON’T RAID them; let hadoop handle them as individual disksMight failBackup data on S3Skip EBS. Use instance store disks, and store data in S3Use Apache WHIRR to setup cluster easily
  • 30.
  • 31.
  • 32.
    Hadoop cluster onEC2 cost$3,500 = 7 c1.xlarge @ $500 / month$1,500 = 15 TB EBS storage @ $0.10 per GB$ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests $5,500 / month$60,000 / year !
  • 33.
    Buy / Rent?Typical hadoop machine cost : $10-15k10 node cluster = $100k Plus data center costsPlus IT-ops costsAmazon Ec2 10 node cluster:$500 * 10 = $5,000 / month = $60k / year
  • 34.
    Buy / RentAmazonEC2 is great, forQuickly getting startedStartupsScaling on demand / rapidly adding more serverspopular social gamesNetflix storyStreaming is powered by EC2Encoding movies ..etcUse 1000s of instancesNot so economical for running clusters 24x7http://blog.rapleaf.com/dev/2008/12/10/rent-or-own-amazon-ec2-vs-colocation-comparison-for-hadoop-clusters/
  • 35.
  • 36.
  • 37.
    Where was thispicture taken?
  • 38.
  • 39.
    Amazon’s Elastic MapReduceBasically ‘on demand’ hadoop clusterStore data on Amazon S3Kick off a hadoop cluster to process dataShutdown when donePay for the HOURS used
  • 40.
  • 41.
    Moving partsLogs gointo ScribeScribe master ships logs into S3, gzippedSpin EMR cluster, run job, doneUsing same old Java MR jobs for EMRSummary data gets directly updated to a mysql (no output files from reducers)
  • 42.
    EMR WinsCost  only pay for usehttp://aws.amazon.com/elasticmapreduce/pricing/Example: EMR ran on 5 C1.xlarge for 3hrsEC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20http://aws.amazon.com/elasticmapreduce/faqs/#billing-4(1 hour of c1.xlarge = 8 hours normalized compute time)EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40Plus S3 storage cost : 1TB / month = $150Data bandwidth from S3 to EC2 is FREE! $25 bucks
  • 43.
    Design WinsBidders nowwrite logs to Scribe directly No mysql at web server machinesWrites much faster!S3 has been a reliable storage and cheap
  • 44.
    EMR WinsNo hadoopcluster to maintainno failed nodes / disks
  • 45.
    EMR WinsHadoop clusterscan be of any size!Can have multiple hadoop clusterssmaller jobs  fewer number of machinesmemory hungry tasks  m1.xlargecpu hungry tasks  c1.xlarge
  • 46.
    EMR trade-offsLower performanceon MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk)Streaming data from S3, for each jobEMR Hadoop is not the latest versionMissing tools : OozieRight now, trading performance for convenience and cost
  • 47.
    Lessons LearnedDebugging afailed MR job is trickyBecause the hadoop cluster is terminated  no logs filesSave log files to S3
  • 48.
    Lessons : Scriptevery thingscripts to launch jar EMR jobsCustom parameters depending on job needs (instance types, size of cluster ..etc)monitor job progressSave logs for later inspectionJob status (finished / cancelled)https://github.com/sujee/amazon-emr-beyond-basics
  • 49.
    Sample Launch Script#!/bin/bash##run-sitestats4.sh# configMASTER_INSTANCE_TYPE="m1.large"SLAVE_INSTANCE_TYPE="c1.xlarge"INSTANCES=5export JOBNAME="SiteStats4"export TIMESTAMP=$(date +%Y%m%d-%H%M%S)# end configecho "==========================================="echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."export t1=$(date +%s)export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”)sh ./emr-wait-for-completion.sh
  • 50.
    Lessons : tweakcluster for each jobMapred-config-m1-xl.xml<configuration> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1024M</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx3000M</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>3</value></property><property> <name>mapred.output.compress</name> <value>true</value></property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property></configuration>
  • 51.
  • 52.
  • 53.
    Map reduce tips: Control the amount of InputWe get different type of eventsevent A (freq: 10,000) >>> event B (100) >> event C (1)Initially we put them all into a single log fileAAAABAABC
  • 54.
    Control Input…So haveto process the entire file, even if we are interested only in ‘event C’ too much wasted processingSo we split the logslog_A….gzlog_B….gzlog_C…gzNow only processing fraction of our logsInput : s3://my_bucket/logs/log_B*x-ref using memcache if needed
  • 55.
    Map reduce tips:Data joining (x-ref)Data is split across log files, need to x-ref during Map phaseUsed to load the data in mapper’s memory (data was small and in mysql)Now we use Membase (Memcached)Two MR jobs are chainedFirst one processes logfile_type_A and populates Membase (very quick, takes minutes)Second one, processes logfile_type_B, cross-references values from Membase
  • 56.
  • 57.
    Map reduce tips:Logfile formatCSV  JSONStarted with CSVCSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL20-40 fields… fragile, position dependant, hard to code url = csv[18]…counting position numbers gets old after 100th time around)If (csv.length == 29) url = csv[28] else url = csv[26]
  • 58.
    Map reduce tips:Logfile formatJSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}Self-describing, easy to add new fields, easy to processurl = map.get(‘url’)Flatten JSON to fit in ONE LINECompresses pretty well (not much data inflation)
  • 59.
    Map reduce tips:Incremental Log ProcessingRecent data (today / yesterday / this week) is more relevant than older data (6 months +)
  • 60.
    Map reduce tips:Incremental Log ProcessingAdding ‘time window’ to our statsonly process newer logs faster
  • 61.
  • 62.
    Where was thispic taken?
  • 63.
  • 64.
    Next steps :faster processingStreaming S3 data for each MR job is not optimalSpin clusterCopy data from S3 to HDFSRun all MR jobs (make use of data locality)terminate
  • 65.
    Next Steps :More ProcessingMore MR jobsMore frequent data processingFrequent log rollsSmaller delta window (1 hr / 15 mins)
  • 66.
    Next steps :new software New SoftwarePig, python mrJOB(from Yelp)Scribe  Cloudera flume?Use work flow tools like OozieHive?Adhoc SQL like queries
  • 67.
    Next Steps :SPOT instancesSPOT instances : name your price (ebay style)Been available on EC2 for a whileJust became available for Elastic map reduce!New cluster setup:10 normal instances + 10 spot instancesSpots may go away anytimeThat is fine! Hadoop will handle node failuresBigger cluster : cheaper & faster
  • 68.
  • 69.
    In summary…Amazon EMRcould be a great solutionWe are happy!
  • 70.
    Take a testdriveJust bring your credit-card http://aws.amazon.com/elasticmapreduce/Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52
  • 71.