1TB/day 
Logging and counting billions of events. 
Scaling infrastructure using Amazon Web Services. 
Dirk Harms-Merbitz - grasswood@icloud.com
Amazon Web Services 
• Flexible toolkit for building Internet applications 
• Infrastructure as a service 
• Enables very fast growth 
• No commitments, capex replaced by opex
Example 
• Customer signs up on web form, specifies number of 
users, data retention policies, based on business needs. 
• Vendor programmatically spins up an instance from a 
custom AMI with EBS volumes or local storage RAIDed as 
needed to match performance, size, and cost parameters. 
• One customer or one thousand customers, the 
infrastructure and scaling of resources is handled by 
Amazon. 
• Vendor focusses on marketing, support and software 
development.
The AWS Toolkit 
• EC2 = Containers on Demand 
• EBS = Elastic Block Storage 
• S3 = Object storage and static HTTP 
• Glacier = Long term storage
Elastic Compute 2 
• Container for OS and application software 
• Storage is EBS or locally attached 
• / on EBS makes it easy to change instance size 
• Standard or custom AMI 
• An EC2 instance is not a server
Elastic Block Storage 
• More reliable than hard drives 
• Building blocks for application specific storage 
• Combine as needed using RAID and LVM 
• Different flavors, PIOPS, GP2, magnetic 
• 1TB max, 10 max per instance, 1TB = $50-$388/mo 
• Elastic Block Storage is not a disk
Local storage 
• Directly attached to an instance 
• Lower cost compared to EBS, much faster 
• Survives reboots but disappears when instance 
is stopped or terminated 
• Best used with instance level redundancy: 
RAID0 with the same data on multiple instances 
allows for very fast processing in parallel
Object Storage 3 
• Stores objects of up to 5TB 
• 4x9 availability, 11x9 durability 
• REST and SOAP interfaces - $5/1M requests 
• HTTP download, easy for customers to access 
• 1TB = $30/mo storage, $120/mo to transfer
AWS Glacier 
• Glacier Storage 
• 4x9 availability, 11x9 durability 
• $10/mo to store 1TB 
• Cost for getting data out is based on speed 
• Getting data out quickly can become expensive
AWS Optimizations 
• EBS optimized instances offer better performance. Your 
storage and network compete otherwise. 
• RAID and LVM are used to combine EBS volumes to 
match application storage size and throughput 
requirements. 
• Local SSDs double in size and speed with RAID0. Data 
survives reboots but snapshots are needed before 
stopping or terminating. 
• Cloud is not just AWS: DigitalOcean, Linode, there are 
many alternatives. EBS however makes resizing easy.
AWS Pro and Con 
• Not hardware: Intuitions based on physical hardware won’t 
transfer. Everything is throttled. 
• Flexible: Used correctly you don’t have to think about scaling 
your hardware to millions of users. Short term, testing ideas. 
• Complex: Easy to use incorrectly, with very low performance and 
very high costs possible as a result. 
• Expensive Mistakes: Storing 6TB for three years can cost as 
much as $83,808 or as little as $4,818. 
• If you know what you need, co-location delivers more for less: A 
physical 6TB drive is faster, lasts 3-5 years and costs $299.
AWS 
• Not appropriate for all businesses: Complexity 
cost, rental cost, slow technology updates. 
• Not appropriate for all applications: nobody 
mines bitcoin in AWS. 
• Not appropriate as workaround when 
management is slow in approving hardware.
Tips & Tricks 
• avoid copying data 
• use parallel or exec 
• speed up ssh, use mosh 
• use fixed length records 
• use raw block devices 
• use bitmaps
avoid copying data 
• write to EBS volume A until full 
• switch to volume B, continue writing 
• detach A and attach to processing instance 
• zero copy when a volume is passed around
parallel and pexec 
• grep, bzip2, wc, awk, sed use only a single CPU core 
• gnu parallel or pexec make use of all cores, local and even neighbors 
• pexec -o - -f instances -e x -c -- 'rsync -ae ssh /etc/hosts $x:/etc/hosts' 
• parallel ping -c1 ::: host1 host2 host2 host4 
• find -name “*csv.gz” -print | parallel zgrep “string” 
• find -name “*.csv.gz” -print | parallel zcat >all.txt 
• cat all.txt | parallel —pipe grep ‘api_key=xyz’ 
• cat all.txt | parallel —pipe wc -l | awk ‘{s+$1} END {print s}’
ssh and mosh 
• 30x faster when reusing ssh connections: 
• ControlMaster auto 
• ControlPersist yes 
• ControlPath ~/.ssh/socket-%r@%h:%p 
• mosh.mit.edu works well over lossy connections 
• including changing locations and IP numbers
fixed length records 
• Fixed length records on raw block devices 
• No compressing and uncompressing 
• No parsing of ASCII 
• No file system 
• No overflow possible, write pointer wraps
raw block devices 
• Counters on raw block devices 
• By keeping just the lower byte of a counter in 
RAM you can divide access frequency by 256 
• RAID0 of SSDs can reach 1000-2000MB/s 
• EBS 100MB/s, RAID0 of multiple EBS 800MB/s
bitmaps 
• Bitmaps for counting things and other uses 
• 100M unique users in 12.5MB of RAM 
• Hourly, Daily, Weekly, Quarterly… 
• 6TB SSD instance = 7000 bits / person on earth

AWS Cloud experience concepts tips and tricks

  • 1.
    1TB/day Logging andcounting billions of events. Scaling infrastructure using Amazon Web Services. Dirk Harms-Merbitz - grasswood@icloud.com
  • 2.
    Amazon Web Services • Flexible toolkit for building Internet applications • Infrastructure as a service • Enables very fast growth • No commitments, capex replaced by opex
  • 3.
    Example • Customersigns up on web form, specifies number of users, data retention policies, based on business needs. • Vendor programmatically spins up an instance from a custom AMI with EBS volumes or local storage RAIDed as needed to match performance, size, and cost parameters. • One customer or one thousand customers, the infrastructure and scaling of resources is handled by Amazon. • Vendor focusses on marketing, support and software development.
  • 4.
    The AWS Toolkit • EC2 = Containers on Demand • EBS = Elastic Block Storage • S3 = Object storage and static HTTP • Glacier = Long term storage
  • 5.
    Elastic Compute 2 • Container for OS and application software • Storage is EBS or locally attached • / on EBS makes it easy to change instance size • Standard or custom AMI • An EC2 instance is not a server
  • 6.
    Elastic Block Storage • More reliable than hard drives • Building blocks for application specific storage • Combine as needed using RAID and LVM • Different flavors, PIOPS, GP2, magnetic • 1TB max, 10 max per instance, 1TB = $50-$388/mo • Elastic Block Storage is not a disk
  • 7.
    Local storage •Directly attached to an instance • Lower cost compared to EBS, much faster • Survives reboots but disappears when instance is stopped or terminated • Best used with instance level redundancy: RAID0 with the same data on multiple instances allows for very fast processing in parallel
  • 8.
    Object Storage 3 • Stores objects of up to 5TB • 4x9 availability, 11x9 durability • REST and SOAP interfaces - $5/1M requests • HTTP download, easy for customers to access • 1TB = $30/mo storage, $120/mo to transfer
  • 9.
    AWS Glacier •Glacier Storage • 4x9 availability, 11x9 durability • $10/mo to store 1TB • Cost for getting data out is based on speed • Getting data out quickly can become expensive
  • 10.
    AWS Optimizations •EBS optimized instances offer better performance. Your storage and network compete otherwise. • RAID and LVM are used to combine EBS volumes to match application storage size and throughput requirements. • Local SSDs double in size and speed with RAID0. Data survives reboots but snapshots are needed before stopping or terminating. • Cloud is not just AWS: DigitalOcean, Linode, there are many alternatives. EBS however makes resizing easy.
  • 11.
    AWS Pro andCon • Not hardware: Intuitions based on physical hardware won’t transfer. Everything is throttled. • Flexible: Used correctly you don’t have to think about scaling your hardware to millions of users. Short term, testing ideas. • Complex: Easy to use incorrectly, with very low performance and very high costs possible as a result. • Expensive Mistakes: Storing 6TB for three years can cost as much as $83,808 or as little as $4,818. • If you know what you need, co-location delivers more for less: A physical 6TB drive is faster, lasts 3-5 years and costs $299.
  • 12.
    AWS • Notappropriate for all businesses: Complexity cost, rental cost, slow technology updates. • Not appropriate for all applications: nobody mines bitcoin in AWS. • Not appropriate as workaround when management is slow in approving hardware.
  • 13.
    Tips & Tricks • avoid copying data • use parallel or exec • speed up ssh, use mosh • use fixed length records • use raw block devices • use bitmaps
  • 14.
    avoid copying data • write to EBS volume A until full • switch to volume B, continue writing • detach A and attach to processing instance • zero copy when a volume is passed around
  • 15.
    parallel and pexec • grep, bzip2, wc, awk, sed use only a single CPU core • gnu parallel or pexec make use of all cores, local and even neighbors • pexec -o - -f instances -e x -c -- 'rsync -ae ssh /etc/hosts $x:/etc/hosts' • parallel ping -c1 ::: host1 host2 host2 host4 • find -name “*csv.gz” -print | parallel zgrep “string” • find -name “*.csv.gz” -print | parallel zcat >all.txt • cat all.txt | parallel —pipe grep ‘api_key=xyz’ • cat all.txt | parallel —pipe wc -l | awk ‘{s+$1} END {print s}’
  • 16.
    ssh and mosh • 30x faster when reusing ssh connections: • ControlMaster auto • ControlPersist yes • ControlPath ~/.ssh/socket-%r@%h:%p • mosh.mit.edu works well over lossy connections • including changing locations and IP numbers
  • 17.
    fixed length records • Fixed length records on raw block devices • No compressing and uncompressing • No parsing of ASCII • No file system • No overflow possible, write pointer wraps
  • 18.
    raw block devices • Counters on raw block devices • By keeping just the lower byte of a counter in RAM you can divide access frequency by 256 • RAID0 of SSDs can reach 1000-2000MB/s • EBS 100MB/s, RAID0 of multiple EBS 800MB/s
  • 19.
    bitmaps • Bitmapsfor counting things and other uses • 100M unique users in 12.5MB of RAM • Hourly, Daily, Weekly, Quarterly… • 6TB SSD instance = 7000 bits / person on earth