Using AWS To Build
A Scalable Machine Data Analytics Service
Christian Beedgen
November 13, 2013

© 2013 Amazon.com, Inc. ...
Who Am I
•  Co-Founder & CTO, Sumo Logic since 2010
–  Cloud-based Machine Data Analytics Service
–  Applications, Operati...
Everything You Know Is Wrong
Everything You Know Is Wrong
Agenda
• 
• 
• 
• 
• 
• 
• 

Introduction To Logs & Logging
Why We Are Building This Service
Architecture Of The Service
D...
Introduction To Logs & Logging
What Is Machine Data?
•  Actually, Machine Generated Data
Curt Monash:
“Data that was produced
entirely by machines OR
dat...
Examples Of Machine Data
• 
• 
• 
• 
• 
• 

Computer, network, and other equipment logs
Satellite and similar telemetry (e...
What Are Logs?
• 
• 
• 
• 
• 

Logs are a kind of Machine Data
Time-stamped bits and pieces of text
Whispers & utterances ...
A Wealth Of Information
• 
• 
• 
• 
• 

Like Twitter for your infrastructure
Machine data analytics…
…is sentiment analysi...
Or Else…
Anatomy Of A Log
Anatomy Of A Log

•  Timestamp with time zone!
Anatomy Of A Log

•  Timestamp with time zone!
•  Log level
Anatomy Of A Log

•  Timestamp with time zone!
•  Log level
•  Host ID & module name (process/service)
Anatomy Of A Log

• 
• 
• 
• 

Timestamp with time zone!
Log level
Host ID & module name (process/service)
Code location o...
Anatomy Of A Log

• 
• 
• 
• 
• 

Timestamp with time zone!
Log level
Host ID & module name (process/service)
Code locatio...
Anatomy Of A Log

• 
• 
• 
• 
• 
• 

Timestamp with time zone!
Log level
Host ID & module name (process/service)
Code loca...
Use Cases
•  Availability & Performance
–  Prevent downtime by proactive analytics, alerting
–  Reduce MTTR by having all ...
Customer Metrics
Use Case

Customer Examples

Metric

Security &
Compliance

Apigee reduced compliance
audit costs by ~50%...
Machine Data Is Big Data
•  Volume
–  Machine Data is voluminous and will continue to grow
–  Our own application creates ...
Why We Are Building This Service
We Need To Evolve
We Need To Evolve
Legacy Products Fall Short
•  Volume leads to scalability issues
–  Every Log Management system will fail – I have seen it...
AWS Enables Innovation
• 
• 
• 
• 
• 

Attending Werner’s talk at Stanford in 2008
First parking lot discussion
This can a...
AWS Enables Sumo Logic
•  Entering an existing market
–  Existing & established competition, some of it huge
–  Catch up &...
Deployment Architecture - Before
Deployment Architecture - After
Architecture Of The System
Development Approach
• 
• 
• 
• 
• 
• 
• 

Developed in Scala because we like it
Many small cohesive modules, low coupling...
Basic Concerns
•  Data ingestion
–  Receiving data
–  Raw storage
–  Full-text indexing

•  Data analysis
–  Interactive a...
Concerns Map To Clusters
• 
• 
• 
• 
• 
• 

A cluster is multiple instances of the same application
Deployed on multiple A...
Ingestion Path
Raw

Receiver

Bus

Index

CQ

S3
Receiver
• 
• 
• 
• 
• 
• 
• 

HTTPS endpoint behind Elastic Load Balancing
Decompress messages from Collector
Extract tim...
Raw
• 
• 
• 
• 
• 
• 
• 

Raw

Receive message blocks from message bus
Encrypt message blocks
Different key for every day ...
Index
• 
• 
• 
• 
• 
• 
• 

Index

Receive message blocks from message bus
Cache message block on disk and ack to message ...
Continuous Query
• 
• 
• 
• 
• 
• 
• 

CQ

Receive message blocks from message bus
Evaluate each message against all searc...
Analytics Path
Query
Service

S3
CQ
Query
• 
• 
• 
• 
• 
• 
• 

Query

Fully distributed streaming query engine
Materialize messages matching search expressio...
Deployment Automation
Why Deployment Automation
• 
• 
• 
• 
• 
• 
• 

Add 1 part developers, 1 part Datacenter-as-API, stir…
Aim for fully integ...
Automation Enables Scale
•  The goal is 100% - accept no less
•  Why U need automation
– 
– 
– 
– 

Number of deployments ...
Current Deployment Stats
• 
• 
• 
• 
• 
• 

4 Deployments running 24/7, 50 for development
20+ clusters per deployment
25+...
dsh: Another AWS deployment tool
• 
• 
• 
• 
• 

Model-driven, describe desired state, run to make it so
High performance ...
Example session
Sie Ist Ein Model & Sie Sieht Gut Aus
•  Model contains concepts
–  Deployment
–  Cluster
–  AWS Resources (Amazon S3, Ama...
Model Snippet
Model Snippet
Differential Deployment
•  Start by finding existing resources
–  Use tagging where it is available
–  Name prefixes (“pro...
Example Of Tag Usage
Making It Fast
•  Parallelize all the things
–  Upload to Amazon S3 while booting instances while creating IAM users
while...
Hyper, Hyper
Making It Fast
•  Parallelize all the things
–  Upload to Amazon S3 while booting instances while creating IAM users
while...
Making It Reliable
•  Check prerequisites before you even try
–  Does Prod account have room for this many instances?
–  D...
Making It Secure
•  Different AWS accounts
–  Per developer
–  Production

•  account.xml!
–  All credentials for one AWS
...
Making It Safe
• 
• 
• 
• 
• 

Let mistakes happen at most once
Add safeguards to prevent operator mistakes
Type in the de...
Making It Easy
•  Automate best practices
–  Distribute instances over availability zones evenly
–  Register instances in ...
Making It Affordable
•  Developers forget to shut stuff down
–  Deployment reaper automatically shuts down deployments
–  ...
Pitfalls
• 
• 
• 
• 

Base AMI plus scripted installation prevents auto scaling
Security group updates cause TCP disconnec...
Loosely Coupled Components
Loose Coupling In The Large
• 
• 
• 
• 
• 

A deployment is made up of many things
Some of these things need to talk to ea...
Service Registry
• 
• 
• 
• 
• 
• 
• 

Service Registry is a concept, enables discovery
A client-side library accessing a ...
The Perils Of Horizontal Scale
• 
• 
• 
• 
• 
• 
• 

Scaling out a multi-tenant processing system
1000s of customers, 1000...
The Perils Of Horizontal Scale
Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

Index

...
The Perils Of Horizontal Scale
1

1

1

1

1

Index
Index
Index
Index
Index

1

1

1

1

1

Index
Index
Index
Index
Index
...
The Perils Of Horizontal Scale
1

1

1

1

1

2

1

2

1

2

1

2

1

2

1

Index
Index
Index
Index
Index

2

1

2

1

2

...
The Perils Of Horizontal Scale
1
5
1
5
1
5
1
5
1
5

2

3

Index
6
7
2

3

Index
6
7
2

3

Index
6
7
2

3

Index
6
7
2

3

...
The Perils Of Horizontal Scale
1Index

1Index

Index

Index

Index

1Index

1Index

Index

Index

Index

Index

Index

Ind...
The Perils Of Horizontal Scale
1Index

1Index

2Index

2Index

2Index

1Index

1Index

2Index

2Index

2Index

Index

Inde...
The Perils Of Horizontal Scale
1Index4
3

1Index4
3

2Index5
3

2Index5
3

2Index6
3

1Index4
3

1Index4
3

2Index5
3

2In...
Customer Partitioning
• 
• 
• 
• 
• 

Each cluster elects a leader node via Zookeeper
Leader runs the partitioning logic
S...
Lessons Learned
Some Tips On AWS S3
•  Use the TransferManager class from the AWS Java SDK
–  Multi-part uploads and downloads
–  Multi-th...
Elastic Block Store
•  RAID-0 makes Amazon EBS faster
–  Use LVM RAID-0 if heavy I/O is required
–  Align stripe sizes wit...
Cost & Business Value
Somebody Has To Pay For Lunch
• 
• 
• 
• 
• 

On-demand resources are very sexy
Automation gives developers their own sand...
Elasticity Is Not An Arbitrary Need
• 
• 
• 
• 
• 
• 
• 

At least in our system, there’s baseline load
At least in our sy...
One More Thing
Amazon CloudTrail
•  Logs! From AWS! The eagle has landed!
•  Amazon CloudTrail logs your API activity to Amazon S3
•  Sum...
Please give us your feedback on this
presentation

BDT401
As a thank you, we will select prize
winners daily for completed...
Chart Example
Category 4
Category 3
Category 2
Category 1
0%

20%
Series 1

40%
60%
Axis Title
Series 2

Series 3

80%
Ser...
Powerpoint Guidelines
Arial

Please do not use gradients, shadows or outlines on shape
elements in your presentation.
PowerPoint Guidelines
When pasting content from another presentation
please paste using “Destination Theme”
Windows

Mac

...
PowerPoint Guidelines
When pasting content Code into a Code template please use the
“Keep Text Only Function” If any addit...
68k Assembly Code Sample
; Syntax Test file for 68k Assembly code
; Some comments about this file
.D0 00000000
MS 2100 000...
Basic text content slide
•  With Content
–  And more content
Title Slide #2
Slide with two columns
Slide with two columns and titles
Slide with space for custom content
Side Content
Description or content with place for
image on the right
Big picture slide
Please give us your feedback on this
presentation

As a thank you, we will select prize
winners daily for completed survey...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT401) | AWS re:Invent 2013
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT401) | AWS re:Invent 2013
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT401) | AWS re:Invent 2013
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT401) | AWS re:Invent 2013
Upcoming SlideShare
Loading in...5
×

Using AWS to Build a Scalable Big Data Management & Processing Service (BDT401) | AWS re:Invent 2013

1,084

Published on

By turning the data center into an API, AWS has enabled Sumo Logic to build a very large scale IT operational analytics platform as a service at unprecedented scale and velocity. Based around Amazon EC2 and Amazon S3, the Sumo Logic system is ingesting many terabytes of unstructured log data a day while at the same time delivering real-time dashboards and supporting hundreds of thousands of queries against the collected data. When co-founder and CTO Christian Beedgen started Sumo Logic, it was obvious that the service would have to scale quickly and elastically, and AWS has been providing the perfect infrastructure for this endeavor from the start.

In this talk, Christian dives into the core Sumo Logic architecture and explains which AWS services are making Sumo Logic possible. Based around an in-house developed automation and continuous deployment system, Sumo Logic is leveraging Amazon S3 in particular for large-scale data management and Amazon DynamoDB for cluster configuration management. By relying on automation, Sumo Logic is also able to perform sophisticated staging of new code for rapid deployment. Using the log-based instrumentation of the Sumo Logic codebase, Christian will dive into the performance characteristics achieved by the system today and share war stories about lessons learned along the way.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,084
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
56
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Using AWS to Build a Scalable Big Data Management & Processing Service (BDT401) | AWS re:Invent 2013

  1. 1. Using AWS To Build A Scalable Machine Data Analytics Service Christian Beedgen November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Who Am I •  Co-Founder & CTO, Sumo Logic since 2010 –  Cloud-based Machine Data Analytics Service –  Applications, Operations, Security •  Server guy, Chief Architect, ArcSight, 2001-2009 –  Major SIEM player in the enterprise space –  Log Management for security & compliance
  3. 3. Everything You Know Is Wrong
  4. 4. Everything You Know Is Wrong
  5. 5. Agenda •  •  •  •  •  •  •  Introduction To Logs & Logging Why We Are Building This Service Architecture Of The Service Deployment Automation Loosely Coupled Components Lessons Learned Cost & Business Value
  6. 6. Introduction To Logs & Logging
  7. 7. What Is Machine Data? •  Actually, Machine Generated Data Curt Monash: “Data that was produced entirely by machines OR data that is more about observing humans than recording their choices.” Daniel Abadi: "Machine-generated data is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action."
  8. 8. Examples Of Machine Data •  •  •  •  •  •  Computer, network, and other equipment logs Satellite and similar telemetry (espionage or science) Location data, RFID chip readings, GPS system output Temperature and other environmental sensor readings Sensor readings from factories, pipelines, etc. Output from many kinds of medical devices
  9. 9. What Are Logs? •  •  •  •  •  Logs are a kind of Machine Data Time-stamped bits and pieces of text Whispers & utterances of your infrastructure Written to disk to a log file by applications Sent over the network by devices
  10. 10. A Wealth Of Information •  •  •  •  •  Like Twitter for your infrastructure Machine data analytics… …is sentiment analysis for machines Free data of tremendous value Don’t forget to manage and analyze it
  11. 11. Or Else…
  12. 12. Anatomy Of A Log
  13. 13. Anatomy Of A Log •  Timestamp with time zone!
  14. 14. Anatomy Of A Log •  Timestamp with time zone! •  Log level
  15. 15. Anatomy Of A Log •  Timestamp with time zone! •  Log level •  Host ID & module name (process/service)
  16. 16. Anatomy Of A Log •  •  •  •  Timestamp with time zone! Log level Host ID & module name (process/service) Code location or class
  17. 17. Anatomy Of A Log •  •  •  •  •  Timestamp with time zone! Log level Host ID & module name (process/service) Code location or class Authentication context
  18. 18. Anatomy Of A Log •  •  •  •  •  •  Timestamp with time zone! Log level Host ID & module name (process/service) Code location or class Authentication context Key-value pairs
  19. 19. Use Cases •  Availability & Performance –  Prevent downtime by proactive analytics, alerting –  Reduce MTTR by having all required data at your fingertips •  Application Release –  Derive metrics from development and staging systems pre-deploy –  Baseline and compare after post-deploy quickly shows errors •  Security & Compliance –  Compliance starts with having all security related logs in one place –  Analytics across all data facilitates detecting breaches and problems
  20. 20. Customer Metrics Use Case Customer Examples Metric Security & Compliance Apigee reduced compliance audit costs by ~50% Availability and Performance Ink saves nearly $500K annually Application Release Intaact reduced errors by 4X
  21. 21. Machine Data Is Big Data •  Volume –  Machine Data is voluminous and will continue to grow –  Our own application creates 1TB/logs per week easily •  Velocity –  Machine Data occurs in real-time, and it is time-stamped –  Needs to be processed in real-time as well •  Variety –  Machine Data is unstructured, or poly-structured at best –  Some standard schema, but sure enough not for you applications
  22. 22. Why We Are Building This Service
  23. 23. We Need To Evolve
  24. 24. We Need To Evolve
  25. 25. Legacy Products Fall Short •  Volume leads to scalability issues –  Every Log Management system will fail – I have seen it –  Why should you bother with scaling yet one more system? •  Velocity challenges processing pipelines –  What good are dashboards if they are not real-time? –  Streaming query engines are absolute must •  Variety isn’t being embraced –  All data should be allowed into the system –  No vendor will ever know your application’s log schema
  26. 26. AWS Enables Innovation •  •  •  •  •  Attending Werner’s talk at Stanford in 2008 First parking lot discussion This can apply to our space! Datacenter as API Massive power up to scraggly devs
  27. 27. AWS Enables Sumo Logic •  Entering an existing market –  Existing & established competition, some of it huge –  Catch up & differentiate at the same time •  A Big Data service –  Scaling on premise is hard and leaves the hard part to the customer –  Now we build one single system to deal with all customers •  This data is important –  Regulatory compliance is among the big drivers for collecting it –  HA & DR concerns all over the place à Amazon S3
  28. 28. Deployment Architecture - Before
  29. 29. Deployment Architecture - After
  30. 30. Architecture Of The System
  31. 31. Development Approach •  •  •  •  •  •  •  Developed in Scala because we like it Many small cohesive modules, low coupling Maven-based build system Layers of modules combined into applications Different applications for different concerns Internal Service-Oriented Architecture Communication via documented protocols
  32. 32. Basic Concerns •  Data ingestion –  Receiving data –  Raw storage –  Full-text indexing •  Data analysis –  Interactive analytics –  Scheduled queries –  Machine learning –  Continuous query evaluation
  33. 33. Concerns Map To Clusters •  •  •  •  •  •  A cluster is multiple instances of the same application Deployed on multiple Amazon EC2 instances Deployed across multiple availability zones Instances within a cluster are oblivious of each other Receive from upstream, talk to downstream Receive from message bus, or talk RPC
  34. 34. Ingestion Path Raw Receiver Bus Index CQ S3
  35. 35. Receiver •  •  •  •  •  •  •  HTTPS endpoint behind Elastic Load Balancing Decompress messages from Collector Extract timestamps from messages Aggregate messages per-customer into blocks Flush blocks to message bus Ack to Collector “Statelessly stateful”/”Statefully stateless” Receiver
  36. 36. Raw •  •  •  •  •  •  •  Raw Receive message blocks from message bus Encrypt message blocks Different key for every day for every customer Flush encrypted message blocks to Amazon S3 Copy blocks as CSV to customer’s Amazon S3 bucket Ack to message bus Fully stateless
  37. 37. Index •  •  •  •  •  •  •  Index Receive message blocks from message bus Cache message block on disk and ack to message bus Add message blocks to Lucene indexes Deal with wildly varying timestamps Flush index shards to Amazon S3 Update meta data database with index shard info Stateful
  38. 38. Continuous Query •  •  •  •  •  •  •  CQ Receive message blocks from message bus Evaluate each message against all search expressions Push matching messages into respective pipelines Ack to message bus Flush results periodically for pickup by client Persist checkpoints periodically to Amazon S3 Stateful, with checkpoint recovery
  39. 39. Analytics Path Query Service S3 CQ
  40. 40. Query •  •  •  •  •  •  •  Query Fully distributed streaming query engine Materialize messages matching search expression Push messages through a pipeline of operators First stage – non-aggregation operators Second stage – aggregation operators Present both raw message results as well as aggregates Results update periodically for interactive UI experience
  41. 41. Deployment Automation
  42. 42. Why Deployment Automation •  •  •  •  •  •  •  Add 1 part developers, 1 part Datacenter-as-API, stir… Aim for fully integrated continuous deployment Checkin à unit test à integration test à deployment Jenkins automates it all – using AWS instances Deployment doesn’t mean production Nite à Stag à Long à Prod deployments There are humans involved as well!
  43. 43. Automation Enables Scale •  The goal is 100% - accept no less •  Why U need automation –  –  –  –  Number of deployments grows (staging, per-developer) Number of AWS resources per deployment grows Number of operators/developers grows Frequency of deployments, changes increases
  44. 44. Current Deployment Stats •  •  •  •  •  •  4 Deployments running 24/7, 50 for development 20+ clusters per deployment 25+ software components deployed Hundreds of instances in production Less than 10 minutes to deploy from scratch Less than 4 minutes to restart hundreds of components
  45. 45. dsh: Another AWS deployment tool •  •  •  •  •  Model-driven, describe desired state, run to make it so High performance due to parallelization Covers all layers of the stack – AWS, OS, Sumo Logic Easy to use and extend, scriptable CLI Developer-friendly, Scala-based, high-level APIs
  46. 46. Example session
  47. 47. Sie Ist Ein Model & Sie Sieht Gut Aus •  Model contains concepts –  Deployment –  Cluster –  AWS Resources (Amazon S3, Amazon Elastic Load Balancing, Amazon DynamoDB, Amazon RDS, etc.) –  Software assemblies –  AWS configuration (IAM users, security groups, etc.) •  Human-readable names: prod-index-5!
  48. 48. Model Snippet
  49. 49. Model Snippet
  50. 50. Differential Deployment •  Start by finding existing resources –  Use tagging where it is available –  Name prefixes (“prod_xxx”) where it isn’t (security groups, IAM, …) •  Fix differences to model –  Start “missing” instances –  Change security group rules, missing IAM users •  Proceed with caution –  Never delete anything that holds data –  Amazon EBS, Amazon DynamoDB, Amazon S3, Amazon RDS
  51. 51. Example Of Tag Usage
  52. 52. Making It Fast •  Parallelize all the things –  Upload to Amazon S3 while booting instances while creating IAM users while setting up security groups while… –  Hyper-concurrent rolling restarts
  53. 53. Hyper, Hyper
  54. 54. Making It Fast •  Parallelize all the things –  Upload to Amazon S3 while booting instances while creating IAM users while setting up security groups while… –  Hyper-concurrent rolling restarts •  Fast enough for development –  Write new code or fix a bug, compile locally –  Push code to development deployment and make it live •  Optimize data transfers –  Use Amazon S3 hashes to only transfer new files –  Only upload changed JARs
  55. 55. Making It Reliable •  Check prerequisites before you even try –  Does Prod account have room for this many instances? –  Do I have the required permissions for the AWS APIs? –  Any model discrepancies I can’t automatically resolve? Too many Amazon EBS volumes? •  Handle common failures automatically –  No m1.large in us-east-1b? Move Amazon EBS volumes to us-west-1c and try there –  Hitting the AWS API rate limit? Throttle and try again –  SSH didn’t come up on the instance? Kill it and launch another –  Eventual consistency in AWS– query until it has the expected state (tags)
  56. 56. Making It Secure •  Different AWS accounts –  Per developer –  Production •  account.xml! –  All credentials for one AWS account (AWS keys, SSH keys) –  Password-protected •  IAM –  One user per Sumo component –  Minimal IAM policy –  Inject AWS credentials •  Security Groups –  Part of the model –  Minimal privileges
  57. 57. Making It Safe •  •  •  •  •  Let mistakes happen at most once Add safeguards to prevent operator mistakes Type in the deployment name before deleting anything Disallow risky operations in production (shutdown Prod) Don’t allow –SNAPSHOT code to be deployed in production
  58. 58. Making It Easy •  Automate best practices –  Distribute instances over availability zones evenly –  Register instances in Elastic Load Balancing and match AZs to instances –  Tag all resources consistently •  Consistent naming –  Generate SSH with logical names
  59. 59. Making It Affordable •  Developers forget to shut stuff down –  Deployment reaper automatically shuts down deployments –  Daily cost emails •  Per-team budgets –  Manager responsible to keep within budget
  60. 60. Pitfalls •  •  •  •  Base AMI plus scripted installation prevents auto scaling Security group updates cause TCP disconnects This is fixed in the VPC stack, however Parallelism can cause stampedes (for example, Amazon DynamoDB) •  Tagging API rate limits are easy to hit
  61. 61. Loosely Coupled Components
  62. 62. Loose Coupling In The Large •  •  •  •  •  A deployment is made up of many things Some of these things need to talk to each other Some of these things come and go Don’t pass in a huge list of static dependencies Start each application with one parameter $ bin/receiver prod.service-registry.sumologic.com!
  63. 63. Service Registry •  •  •  •  •  •  •  Service Registry is a concept, enables discovery A client-side library accessing a Zookeeper cluster Services are abstracted into types Application provides and consumes different services Sumo Logic services (RPC) Third-party services (message bus) AWS services (Amazon ElastiCache, Amazon RDS)
  64. 64. The Perils Of Horizontal Scale •  •  •  •  •  •  •  Scaling out a multi-tenant processing system 1000s of customers, 1000s of machines Parallelism is good, but locality has to be considered 1 customer distributed over 1000 machines is bad No single machine getting enough load for that customer Batches & shards will become too small Metadata and in-memory structures grow out of proportion
  65. 65. The Perils Of Horizontal Scale Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index
  66. 66. The Perils Of Horizontal Scale 1 1 1 1 1 Index Index Index Index Index 1 1 1 1 1 Index Index Index Index Index 1 1 1 1 1 Index Index Index Index Index 1 1 1 1 1 Index Index Index Index Index 1 1 1 1 1 Index Index Index Index Index
  67. 67. The Perils Of Horizontal Scale 1 1 1 1 1 2 1 2 1 2 1 2 1 2 1 Index Index Index Index Index 2 1 2 1 2 1 2 1 2 1 Index Index Index Index Index 2 1 2 1 2 1 2 1 2 1 Index Index Index Index Index 2 1 2 1 2 1 2 1 2 1 Index Index Index Index Index 2 Index 2 Index 2 Index 2 Index 2 Index
  68. 68. The Perils Of Horizontal Scale 1 5 1 5 1 5 1 5 1 5 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 4 1 8 5 4 1 8 5 4 1 8 5 4 1 8 5 4 1 8 5 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 4 1 8 5 4 1 8 5 4 1 8 5 4 1 8 5 4 1 8 5 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 4 1 8 5 4 1 8 5 4 1 8 5 4 1 8 5 4 1 8 5 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 4 1 8 5 4 1 8 5 4 1 8 5 4 1 8 5 4 1 8 5 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 2 3 Index 6 7 4 8 4 8 4 8 4 8 4 8
  69. 69. The Perils Of Horizontal Scale 1Index 1Index Index Index Index 1Index 1Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index
  70. 70. The Perils Of Horizontal Scale 1Index 1Index 2Index 2Index 2Index 1Index 1Index 2Index 2Index 2Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index Index
  71. 71. The Perils Of Horizontal Scale 1Index4 3 1Index4 3 2Index5 3 2Index5 3 2Index6 3 1Index4 3 1Index4 3 2Index5 3 2Index5 3 2Index6 3 7 Index 7 Index 5 Index 8 5 Index 8 5Index6 8 7 Index 7 Index 5 Index 8 5 Index 8 5Index6 8 7Index 7Index 5Index 8 5Index 8 5Index6 8
  72. 72. Customer Partitioning •  •  •  •  •  Each cluster elects a leader node via Zookeeper Leader runs the partitioning logic Set[Customer], Set[Instance] à Map[Instance, Set[Customer]]! Partitioning written to Zookeeper Example: indexer node knows which customer’s message blocks to pull from message bus
  73. 73. Lessons Learned
  74. 74. Some Tips On AWS S3 •  Use the TransferManager class from the AWS Java SDK –  Multi-part uploads and downloads –  Multi-threaded, overall latency reduction •  Use random prefixes for keynames in Amazon S3 buckets –  Amazon S3 partitions by keyname prefix ! http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html •  Endpoint URL for Amazon S3 –  s3.amazonaws.com might go to Virginia, or Pacific Northwest (!) –  If you are in us-east, use s3-external-1.amazonaws.com instead
  75. 75. Elastic Block Store •  RAID-0 makes Amazon EBS faster –  Use LVM RAID-0 if heavy I/O is required –  Align stripe sizes with file system block sizes •  Snapshotting Amazon EBS volumes –  Snapshots eat performance –  Even for volumes with provisioned IOPS •  Overlapping snapshots –  Can be scheduled too close together, like every minute –  I/Os start taking 30+ seconds
  76. 76. Cost & Business Value
  77. 77. Somebody Has To Pay For Lunch •  •  •  •  •  On-demand resources are very sexy Automation gives developers their own sandbox Compute is the most easily incurred cost You need an automated reaper Or just raise another round… J
  78. 78. Elasticity Is Not An Arbitrary Need •  •  •  •  •  •  •  At least in our system, there’s baseline load At least in our system, the cost is in compute Alert-based scaling can be safe & effective Measure your spend with tools that are out there We actually use Sumo Logic for that! Look for a moving average of resource consumption Buy Reserved Instances, don’t fret the instance types
  79. 79. One More Thing
  80. 80. Amazon CloudTrail •  Logs! From AWS! The eagle has landed! •  Amazon CloudTrail logs your API activity to Amazon S3 •  Sumo Logic will read from Amazon S3, allow analysis
  81. 81. Please give us your feedback on this presentation BDT401 As a thank you, we will select prize winners daily for completed surveys! Thank You
  82. 82. Chart Example Category 4 Category 3 Category 2 Category 1 0% 20% Series 1 40% 60% Axis Title Series 2 Series 3 80% Series 4 100%
  83. 83. Powerpoint Guidelines Arial Please do not use gradients, shadows or outlines on shape elements in your presentation.
  84. 84. PowerPoint Guidelines When pasting content from another presentation please paste using “Destination Theme” Windows Mac Note: This works when copying entire slides from other presentations as long as the source presentation is also 16:9
  85. 85. PowerPoint Guidelines When pasting content Code into a Code template please use the “Keep Text Only Function” If any additional coloring needs to be done to your code type please do it after pasting it into your slide. Windows Mac
  86. 86. 68k Assembly Code Sample ; Syntax Test file for 68k Assembly code ; Some comments about this file .D0 00000000 MS 2100 00000002 MM 2000;DI LEA.L $002100,A1 MOVE.L #2,-(A1) BSR $00002050 MM 2050;DI MOVE.L (A1)+,D1 MOVE.L (A1),D2 ADD.L D1,D2
  87. 87. Basic text content slide •  With Content –  And more content
  88. 88. Title Slide #2
  89. 89. Slide with two columns
  90. 90. Slide with two columns and titles
  91. 91. Slide with space for custom content
  92. 92. Side Content Description or content with place for image on the right
  93. 93. Big picture slide
  94. 94. Please give us your feedback on this presentation As a thank you, we will select prize winners daily for completed surveys! Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×