• Save
BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee

BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee






Total Views
Views on SlideShare
Embed Views



11 Embeds 2,550

http://www.techgig.com 2278
http://www.bigdatacloud.com 105
http://sozialpapier.com 55
http://www.sozialpapier.com 43 39
http://paper.li 11 9 4
http://www.linkedin.com 3
url_unknown 2 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee Presentation Transcript

  • ‘Amazon EMR’ coming up…by Sujee Maniyam
  • Big Data Cloud Meetup
    Cost Effective Big-Data Processing using Amazon Elastic Map Reduce
    Sujee Maniyam
    hello@sujee.net | www.sujee.net
    July 08, 2011
  • Cost Effective Big-Data Processing using Amazon Elastic Map Reduce
    Sujee Maniyam
  • Quiz
    Where was this picture taken?
  • Quiz : Where was this picture taken?
  • Answer : Montara Light House
  • Hi, I’m Sujee
    10+ years of software development
    enterprise apps  web apps iphone apps  Hadoop
    Hands on experience with Hadoop / Hbase/ Amazon ‘cloud’
    More : http://sujee.net/tech
  • I am an ‘expert’ 
  • Ah.. Data
  • Nature of Data…
    Primary Data
    Email, blogs, pictures, tweets
    Critical for operation (Gmail can’t loose emails)
    Secondary data
    Wikipedia access logs, Google search logs
    Not ‘critical’, but used to ‘enhance’ user experience
    Search logs help predict ‘trends’
    Yelp can figure out you like Chinese food
  • Data Explosion
    Primary data has grown phenomenally
    But secondary data has exploded in recent years
    “log every thing and ask questions later”
    Used for
    Recommendations (books, restaurants ..etc)
    Predict trends (job skills in demand)
    Show ADS ($$$)
    ‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook)
    Startups are struggling to get on top of ‘big data’
  • Hadoop to Rescue
    Hadoop can help with BigData
    Hadoop has been proven in the field
    Under active development
    Throw hardware at the problem
    Getting cheaper by the year
    Bleeding edge technology
    Hire good people!
  • Hadoop: It is a CAREER
  • Data Spectrum
  • Who is Using Hadoop?
  • Big Guys
  • Startups
  • Startups and bigdata
  • About This Presentation
    Based on my experience with a startup
    5 people (3 Engineers)
    Ad-Serving Space
    Amazon EC2 is our ‘data center’
    Web stack : Python, Tornado, PHP, mysql , LAMP
    Amazon EMR to crunch data
    Data size : 1 TB / week
  • Story of a Startup…month-1
    Each web serverwrites logs locally
    Logs were copiedto a log-serverand purged from web servers
    Log Data size : ~100-200 G
  • Story of a Startup…month-6
    More web servers comeonline
    Aggregate log serverfalls behind
  • Data @ 6 months
    2 TB of data already
    50-100 G new data / day
    And we were operating on 20% of our capacity!
  • Future…
  • Solution?
    Scalable database (NOSQL)
    Hadoop log processing / Map Reduce
  • What We Evaluated
    1) Hbase cluster
    2) Hadoop cluster
    3) Amazon EMR
  • Hadoop on Amazon EC2
    1) Permanent Cluster
    2) On demand cluster (elastic map reduce)
  • 1) Permanent Hadoop Cluster
  • Architecture 1
  • Hadoop Cluster
    7 C1.xlarge machines
    15 TB EBS volumes
    Sqoop exports mysql log tables into HDFS
    Logs are compressed (gz) to minimize disk usage (data locality trade-off)
    All is working well…
  • Lessons Learned
    C1.xlarge is pretty stable (8 core / 8G memory)
    EBS volumes
    max size 1TB, so string few for higher density / node
    DON’T RAID them; let hadoop handle them as individual disks
    ?? : Skip EBS. Use instance store disks, and store data in S3
  • Amazon Storage Options
  • 2 months later
    Couple of EBS volumes DIE
    Couple of EC2 instances DIE
    Maintaining the hadoop cluster is mechanical job less appealing
    Our jobs utilization is about 50%
    But still paying for machines running 24x7
  • Amazon EC2 Cost
  • Hadoop cluster on EC2 cost
    $3,500 = 7 c1.xlarge @ $500 / month
    $1,500 = 15 TB EBS storage @ $0.10 per GB
    $ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests
     $5,500 / month
    $60,000 / year !
  • Buy / Rent ?
    Typical hadoop machine cost : $10k
    10 node cluster = $100k
    Plus data center costs
    Plus IT-ops costs
    Amazon Ec2 10 node cluster:
    $500 * 10 = $5,000 / month = $60k / year
  • Buy / Rent
    Amazon EC2 is great, for
    Quickly getting started
    Scaling on demand / rapidly adding more servers
    popular social games
    Netflix story
    Streaming is powered by EC2
    Encoding movies ..etc
    Use 1000s of instances
    Not so economical for running clusters 24x7
  • Next : Amazon EMR
  • Where was this picture taken?
  • Answer : Pacifica Pier
  • Amazon’s solution : Elastic Map Reduce
    Store data on Amazon S3
    Kick off a hadoop cluster to process data
    Shutdown when done
    Pay for the HOURS used
  • Architecture : Amazon EMR
  • Moving parts
    Logs go into Scribe
    Scribe master ships logs into S3, gzipped
    Spin EMR cluster, run job, done
    Using same old Java MR jobs for EMR
    Summary data gets directly updated to a mysql
  • EMR Launch Scripts
    to launch jar EMR jobs
    Custom parameters depending on job needs (instance types, size of cluster ..etc)
    monitor job progress
    Save logs for later inspection
    Job status (finished / cancelled)
  • Sample Launch Script
    ## run-sitestats4.sh
    # config
    export JOBNAME="SiteStats4"
    export TIMESTAMP=$(date +%Y%m%d-%H%M%S)
    # end config
    echo "==========================================="
    echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...."
    export t1=$(date +%s)
    export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”)
    sh ./emr-wait-for-completion.sh
  • Mapred-config-m1-xl.xml
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <decription>4 is running out of memory</description>
  • emr-wait-for-completion.sh
    Polls for job status periodically
    Saves the logs
    Calculates job run time
  • Saved Logs
  • Sample Saved Log
  • Data joining (x-ref)
    Data is split across log files, need to x-ref during Map phase
    Used to load the data in mapper’s memory (data was small and in mysql)
    Now we use Membase (Memcached)
    Two MR jobs are chained
    First one processes logfile_type_A and populates Membase (very quick, takes minutes)
    Second one, processes logfile_type_B, cross-references values from Membase
  • X-ref
  • EMR Wins
    Cost  only pay for use
    Example: EMR ran on 5 C1.xlarge for 3hrs
    EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20
    (1 hour of c1.xlarge = 8 hours normalized compute time)
    EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40
    Plus S3 storage cost : 1TB / month = $150
    Data bandwidth from S3 to EC2 is FREE!
     $25 bucks
  • EMR Wins
    No hadoop cluster to maintainno failed nodes / disks
    Bonus : Can tailor cluster for various jobs
    smaller jobs  fewer number of machines
    memory hungry tasks  m1.xlarge
    cpu hungry tasks  c1.xlarge
  • Design Wins
    Bidders now write logs to Scribe directly
    No mysql at web server machines
    Writes much faster!
    S3 has been a reliable storage and cheap
  • Next : Lessons Learned
  • Where was this pic taken?
  • Answer : Foster City
  • Lessons learned : Logfile format
    CSV  JSON
    Started with CSV
    CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL
    20-40 fields… fragile, position dependant, hard to code
    url = csv[18]…counting position numbers gets old after 100th time around)
    If (csv.length == 29) url = csv[28] else url = csv[26]
    JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….}
    Self-describing, easy to add new fields, easy to process
    url = map.get(‘url’)
  • Lessons Learned : Control the amount of Input
    We get different type of events
    event A (freq: 10,000) >>> event B (100) >> event C (1)
    Initially we put them all into a single log file
  • Control Input…
    So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing
    So we split the logs
    Now only processing fraction of our logs
    Input : s3://my_bucket/logs/log_B*
    x-ref using memcache if needed
  • Lessons learned : Incremental Log Processing
    Recent data (today / yesterday / this week) is more relevant than older data (6 months +)
    Adding ‘time window’ to our stats
    only process newer logs faster
  • EMR trade-offs
    Lower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk)
    Streaming data from S3, for each job
    EMR Hadoop is not the latest version
    Missing tools : Oozie
    Right now, trading performance for convenience and cost
  • Next steps : faster processing
    Streaming S3 data for each MR job is not optimal
    Spin cluster
    Copy data from S3 to HDFS
    Run all MR jobs (make use of data locality)
  • Next Steps : More Processing
    More MR jobs
    More frequent data processing
    Frequent log rolls
    Smaller delta window
  • Next steps : new software
    New Software
    Python, mrJOB(from Yelp)
    Scribe  Cloudera flume?
    Use work flow tools like Oozie
    Adhoc SQL like queries
  • Next Steps : SPOT instances
    SPOT instances : name your price (ebay style)
    Been available on EC2 for a while
    Just became available for Elastic map reduce!
    New cluster setup:
    10 normal instances + 10 spot instances
    Spots may go away anytime
    That is fine! Hadoop will handle node failures
    Bigger cluster : cheaper & faster
  • Example Price Comparison
  • Next Steps : nosql
    Summary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing)
    Evaluating nosql solutionsusing Membase in limited capacity
    Watch out for Amazon’s Hbase offering
  • Take a test drive
    Just bring your credit-card 
    Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52
  • Thanks
    Sujee Maniyam
    Devil’s slide, Pacifica