Harish Ganesan
CTO
8KMiles
2013
P1) This Presentation is
P2) Strongly Inspired by “Guy Ritchie”
Movies
P3) Disclaimer : All images are downloaded from
internet. If you find any of the content / images violating
copyright, please let me know and I will act upon it
immediately
AGENDA
• Case
• Challenge
• Solution
• Learning's
• About us
Case
Cigarette smoking is injurious to health
• Mobile Advertising company, USA
• Forbes 1000 clientele
• TB’s of unstructured data -> Big Data
Problem
Lock
• Hourly ~1 TB
• CDN Logs
• Text Files
• XML Files
• Geo data files
• Server logs
• DB records
STOCK
• Reduce the cost leakage
• How to Save $$$ ?
Challenges
• Daily (was OK), Monthly (Pain) and Historical
analysis ( almost dead )
• How do we Transfer, Store, Analyze and Share ?
• How to optimize costs at this scale ?
Solution
Cigarette smoking is injurious to health
• Use AWS Cloud for hosting Analytics module
• Amazon EMR for unstructured Log Analysis
• Automation using Scripts, Java code and other
tools
Social / 3rd
Party
Feeds/Cloud
Logs
Stage 1: Data Transfer
• Tsunami UDP
• ~1TB un compressed logs
every hour
• High bandwidth EC2’s for
Tsunami UDP
• Other Popular Options :
• Aspera
• AWS Import/Export
• WAN optimization
• AWS Direct Connect
Amazon S3
Logs
Stage 2: Storage
• Amazon Web Services Building Block
– S3
• Scalable Object Store
• Inherently Fault Tolerant
• ~2 TB of compressed logs every day
• S3 RR option for intermediate
outputs
• Amazon Glacier for archivalSocial / 3rd
Party
Feeds/Cloud
Amazon S3
Elastic
MapReduce
Logs
Stage 3: Analyze
• Elastic MapReduce
Service of Amazon
• Minimal Setup time
• Log Analysis
• ~2000 mappers /
750 reducers @
peak
• ~250 m1.xlarge
task nodes (1000
cores, 3750 GB
RAM) @ peakSocial / 3rd
Party
Feeds/Cloud
• Amazon EMR is great
• But adding Spot EC2 is super cool
Wait !!!
What is Amazon Spot ?
13
• Time-flexible, interruption-tolerant tasks
• Bid Price & Spot Price
• M1.xlarge Price Comparison
• $0.480 per Hour – On Demand
• $0.052 per Hour - Spot
• You will never pay more than your
maximum bid price per hour
•Spot Instance may be interrupted
• If interrupted you will not be charged for
any partial hour of usage. (*Free)
Spot Bidding Strategies
14
•Just above Spot Price
•Between Spot Price & On Demand
Price
•On Demand Price
•Above On Demand Price
Spot Price Variations - AZ
Amazon EMR with Spot Instance
Project Master
Instance
Group
Core Instance
Group
Task Instance
Group
Long-running
clusters
on-demand on-demand Spot
Cost-driven
workloads
spot spot Spot
Data-critical
workloads
on-demand on-demand Spot
Application
testing
spot Spot Spot
Amazon S3
Elastic
MapReduce
Social /
3rd Party
Feeds
Logs
Stage 4: Custom EMR Manager
• We created a Custom EMR
Manager
• Choose spot based on:
• Past price trend intelligence
• Choose AZ based on Current
Market Prices
• Choose between Large vs
Extra Large
• Spot Pricing Strategy :
• Set Spot Price = On Demand
Price
• Over board <20% of On
Demand Price at times
• Dynamic Sizing the Core / Task
nodes
• Dynamic EMR Cluster creationCustom EMR
Manager
Some Spot Use Cases
18
• Analytics & Big Data
• Scientific computing
• Web crawling
• Financial model and Analysis
• Testing
• Image & Media Encoding
66 % savings
50 % savings
57 % savings
Learning
• Spot + On demand EC2 is a deadly combination for cost savings
• Every millisecond matters in MR – Tune your code
• Merge Files – Bigger ones are better for processing
More Learning …
• Custom Job Manager was designed by us
• 1 File Per Mapper was better for our case in AWS
• Understand the performance constraints of AWS and
work with it
• Compress data : Both storage and transit(.LZO & Snappy)
Continues…
• Keep configuration data in local memory or Amazon
DynamoDB
• Reducers split files suitable for next job mappers
• Elasticity – Increase/Decrease Task nodes
• Elasticity – Create new EMR Clusters matching the Logs
(Core + Task)
Value
• ~56% cost savings from pure On-Demand model for Core+
Task Nodes
• Automation vastly reduced Labor cost ( initial + on going)
• Customer CXO’s were happy
• AWS Premium Partner
• Solution Experts in
• Cloud Computing
• Big Data
• Identity Management
About US
Shoot your ?
Harish@8kmiles.com
http://harish11g.blogspot.com
@harish11g
harishganesan

Cloud Connect 2013- Lock Stock and x Smoking EC2's

  • 1.
  • 2.
    P1) This Presentationis P2) Strongly Inspired by “Guy Ritchie” Movies P3) Disclaimer : All images are downloaded from internet. If you find any of the content / images violating copyright, please let me know and I will act upon it immediately
  • 3.
    AGENDA • Case • Challenge •Solution • Learning's • About us
  • 4.
    Case Cigarette smoking isinjurious to health • Mobile Advertising company, USA • Forbes 1000 clientele • TB’s of unstructured data -> Big Data Problem
  • 5.
    Lock • Hourly ~1TB • CDN Logs • Text Files • XML Files • Geo data files • Server logs • DB records
  • 6.
    STOCK • Reduce thecost leakage • How to Save $$$ ?
  • 7.
    Challenges • Daily (wasOK), Monthly (Pain) and Historical analysis ( almost dead ) • How do we Transfer, Store, Analyze and Share ? • How to optimize costs at this scale ?
  • 8.
    Solution Cigarette smoking isinjurious to health • Use AWS Cloud for hosting Analytics module • Amazon EMR for unstructured Log Analysis • Automation using Scripts, Java code and other tools
  • 9.
    Social / 3rd Party Feeds/Cloud Logs Stage1: Data Transfer • Tsunami UDP • ~1TB un compressed logs every hour • High bandwidth EC2’s for Tsunami UDP • Other Popular Options : • Aspera • AWS Import/Export • WAN optimization • AWS Direct Connect
  • 10.
    Amazon S3 Logs Stage 2:Storage • Amazon Web Services Building Block – S3 • Scalable Object Store • Inherently Fault Tolerant • ~2 TB of compressed logs every day • S3 RR option for intermediate outputs • Amazon Glacier for archivalSocial / 3rd Party Feeds/Cloud
  • 11.
    Amazon S3 Elastic MapReduce Logs Stage 3:Analyze • Elastic MapReduce Service of Amazon • Minimal Setup time • Log Analysis • ~2000 mappers / 750 reducers @ peak • ~250 m1.xlarge task nodes (1000 cores, 3750 GB RAM) @ peakSocial / 3rd Party Feeds/Cloud
  • 12.
    • Amazon EMRis great • But adding Spot EC2 is super cool Wait !!!
  • 13.
    What is AmazonSpot ? 13 • Time-flexible, interruption-tolerant tasks • Bid Price & Spot Price • M1.xlarge Price Comparison • $0.480 per Hour – On Demand • $0.052 per Hour - Spot • You will never pay more than your maximum bid price per hour •Spot Instance may be interrupted • If interrupted you will not be charged for any partial hour of usage. (*Free)
  • 14.
    Spot Bidding Strategies 14 •Justabove Spot Price •Between Spot Price & On Demand Price •On Demand Price •Above On Demand Price
  • 15.
  • 16.
    Amazon EMR withSpot Instance Project Master Instance Group Core Instance Group Task Instance Group Long-running clusters on-demand on-demand Spot Cost-driven workloads spot spot Spot Data-critical workloads on-demand on-demand Spot Application testing spot Spot Spot
  • 17.
    Amazon S3 Elastic MapReduce Social / 3rdParty Feeds Logs Stage 4: Custom EMR Manager • We created a Custom EMR Manager • Choose spot based on: • Past price trend intelligence • Choose AZ based on Current Market Prices • Choose between Large vs Extra Large • Spot Pricing Strategy : • Set Spot Price = On Demand Price • Over board <20% of On Demand Price at times • Dynamic Sizing the Core / Task nodes • Dynamic EMR Cluster creationCustom EMR Manager
  • 18.
    Some Spot UseCases 18 • Analytics & Big Data • Scientific computing • Web crawling • Financial model and Analysis • Testing • Image & Media Encoding 66 % savings 50 % savings 57 % savings
  • 19.
    Learning • Spot +On demand EC2 is a deadly combination for cost savings • Every millisecond matters in MR – Tune your code • Merge Files – Bigger ones are better for processing
  • 20.
    More Learning … •Custom Job Manager was designed by us • 1 File Per Mapper was better for our case in AWS • Understand the performance constraints of AWS and work with it • Compress data : Both storage and transit(.LZO & Snappy)
  • 21.
    Continues… • Keep configurationdata in local memory or Amazon DynamoDB • Reducers split files suitable for next job mappers • Elasticity – Increase/Decrease Task nodes • Elasticity – Create new EMR Clusters matching the Logs (Core + Task)
  • 22.
    Value • ~56% costsavings from pure On-Demand model for Core+ Task Nodes • Automation vastly reduced Labor cost ( initial + on going) • Customer CXO’s were happy
  • 23.
    • AWS PremiumPartner • Solution Experts in • Cloud Computing • Big Data • Identity Management About US
  • 24.