What if you were told that within three months, you had to scale your existing platform from 1,000 req/sec (requests per second) to handle 300,000 req/sec with an average latency of 25 milliseconds? And that you had to accomplish this with a tight budget, expand globally, and keep the project confidential until officially announced by well-known global mobile device manufacturers? That’s what exactly happened to us. This session explains how The Weather Company partnered with AWS to scale our data distribution platform to prepare for unpredictable global demand. We cover the many challenges that we faced as we worked on architecture design, technology and tools selection, load testing, deployment and monitoring, and how we solved these challenges using AWS.
2. What to Expect from the Session
Building a Big Data Distribution Platform:
- Goals
- Architecture
- Logical and Physical Components
- Data Supply Chain, from Ingest to
Distribution
- Journey
- Building, Tuning and Scaling the Platform
- AWS Insights
- Evolution of the Architecture
Audience:
- Engineering Leaders
- Architects
6. Background: The Weather Company
We power weather for
Apple, Facebook,
Google, Microsoft,
Twitter, Yahoo and
many more
Our B2B Division, WSI,
has 4,600+ B2B clients
in 60 countries.
WHERE THE WORLD GETS ITS WEATHER
#1 MOST DISTRIBUTED
Cable Network
170M+ App Downloads
47.2M Unduplicated Monthly
Uniques
124M+
Monthly Unique
72% visit 2x or more Daily
7. Background: A Data Company
Data
Network of
100K+ weather
sensors
Global Lightning
Detection
Network
Global Radar &
Location Data
Largest
Collection of
Weather Data
State-of-the-Science
Forecasts
Technologies
Industry Best
Forecast Modeling
Proprietary
Radar
Algorithms
Proprietary
Weather
Analytics
220+ Fulltime
Meteorologists
TWC Content (Video,
Images, Articles)
Weather APIs Content APIs
20+ TB Data Daily
800+ Sources of
Ingest
40+ Billion API
Requests Daily
8. Background: About Data
Weather Data
- Observations
- Forecasts
- Radar
- Alerts
- Notices
- Emergency Bulletins
- Health & Life Style
Content
- Articles
- Images
- Slide Shows
- Videos
- Maps
Domain Specific
- Aviation
- Energy
- Insurance
9. Background: Big Data
- Push/Pull, every 5 minutes
- Real Time Alerts & Notification
- World’s most volatile atmospheric data
- 15-20 sec. to prepare and serve
- 800+ Partners
- 50+ GB Raw compressed data
- Several Billion Request / day
Big Data
Variety
VolumeVelocity
Textual data, structured, unstructured, binary data, pictures, images, videos
10. Background: About Distribution
Digital
- Weather.com,
Wunderground.com
- Mobile Apps on all Major
Mobile OS Platforms
Partnerships
- Major Mobile Phone
Company
- Major Search Engine
- Many Others …
B2B
- Major Airlines
- Energy Trading Desks
- Many Others …
40+ Billion API Requests / day
Expect 60 Billion / day by EOY 2015
We power weather for
Apple, Facebook,
Google, Microsoft,
Twitter, Yahoo and
many more
Our B2B Division, WSI,
has 4,600+ B2B clients
in 60 countries.
124M+
Monthly Unique
72% visit 2x or more Daily
170M+ App Downloads
47.2M Unduplicated Monthly
Uniques
11. The Dark Ages: Before The Cloud
- Run From TWC Data Centers
- Slow Time To Market
- Product
- Content
- Limited Distributed Scaling
- Limits of our existing Data
Centers
- Batch Based Forecast Systems
- Java Based Monolithic
Applications
- Big Web, Mobile Web
- Data Services
- Homegrown CMS
12. Business
- Build a Low Latency Global On Demand
Forecasting System
- Build a Highly Scalable Global Data
Distribution Platform
- Reboot Digital Properties (weather.com,
Mobile Apps, CMS)
- Reduce time to deploy new data sets
- Data Distribution APIs as Product
- Secured/Metered access to APIs
- Consolidate Data Centers
Reboot & Reimagine: Goals
Technical
- 100% cloud based
- Capable of handling billions of requests a day
- Capable of ingesting & processing Terabytes
of data a day
- Low latency APIs (25-100 ms)
- Highly Scalable
- Highly Available (99.99)
- Generic Data Processing Engine (DPE)
- Developer Friendly APIs
- Authentication, metering, and throttling
14. Architecture: Component Layers
- Large Undertaking – Divide & Conquer
- Loosely Coupled Layered Architecture
- Focus on your Core Competency
- Best Tool/Technology for the job
- Independent Delivery Timelines
- DATA PLATFORM: Weather Data
Distribution As A Service
- Eat your own dog food!
Data Processing Engine
Data Services
Storage
Systems of
Record
GatewayCDN
15. Architecture: Data Processing Engine (DPE)
- Generic DPE
- API Driven
- Data Agnostic
- Extensible
- Always on, Always flowing
- Asynchronous, Non Blocking
- High availability
- Low latency
- Horizontal scalability
Data Processing Engine
Data Services
Storage
Systems of
Record
GatewayCDN
16. Architecture: Data Processing Engine (DPE)
Push/Pull Data
Providers
IAPI Rabbit MQ
DPE
Redis
Riak
S3
Rabbit MQ
System Of Record
(e.g. Forecast On Demand)
DPE Core
Plugin 1 Plugin 2 Plugin 3
- DPE Architecture
- DPE Core
- Custom Plugins for Process, Download,
Store, Archive
- Technical Stack
- Java 1.7
- Storage (Redis)
- Archive (Riak, S3)
- Distribution – RabbitMQ
- OS: Amazon-Linux (Centos 6 variant)
- Ingestion API
- RestFul Web Service
- Messaging Queue
- RabbitMQ Cluster
- Workers
- DPE
17. Architecture: Data Flow (DPE)
Private Subnet
RabbitMQ
Cluster
IAPI Endpoint
AZ A
AZ B
Public Subnet
Public Subnet
Private Subnet
Data Processing
Engine
Private
Subnet
Data
Publisher
Private
Subnet
18. Architecture: Storage
- Polyglot Architecture
- Best Store for the Job
- Most Cost Effective
Storage for the Job
- BYOS: Bring Your Own Store
- Cache Rich!
Data Processing Engine
Data Services
Storage
Systems of
Record
GatewayCDN
19. Architecture: Storage Polyglot
- Archive
- Images
- Videos
Bucket
Key/Value
Master
Slaves
- Real-time Data
and Caching
Key/Value
Node
NodeNode
Node
Key/Value
- Historical Weather
Archive
- Data Migration
- Gateway Data
- Analytics
Node
NodeNode
Node
Columnar
- Analytics
Parquet
Columnar
Storage
Repositories
MySQL
SQL
Server
- Informatica
- Drupal
20. Architecture: Cache is your friend!
CDN
Master
Slaves
- App Cache
Key/Value
(with data types
for values)
- Origin Cache
- Edge Caching
- Edge Compute
- Make Sure All Data Elements are TTL Driven
- Always Respect Cache Control Headers
VarnishEC2 EC2
App Instances
EC2 EC2
- And Keep It Simple!
21. Architecture: Systems Of Record
- Let the system designers focus on the
problem they are trying to solve
- Let them pick the best technology
- Just Make sure they interface using
standard protocols
- Let DPE handle Ingest
- Let Services Layer handle
Distribution
- Support both Push/Pull model for
publication to distribution engineData Processing Engine
Data Services
Storage
Systems of
Record
GatewayCDN
22. Architecture: Systems of Record
Forecast On Demand CMS
GET Model Post Model
Forecast On Demand
Data Services Data Services
Content Management system
Get: On Cache Miss Post: On Publish
RESTFul End Point
Currents On Demand
GET Model
Currents On Demand
Data Services
Get: On Cache Miss
23. Architecture: Data Services
Data Processing Engine
Data Services
Storage
Systems of
Record
- RestFul API Design
- Stateless
- Decoupled
- Atomic / Aggregation Services
- Support both Push/Pull Model
- API Key driven Auth/Metering
- Horizontally Scalable
- Capable of serving billions of
request / day
- Data lends well to caching
GatewayCDN
24. Architecture: Distribution – Weather Data
Redis
Riak
OAPI API Gateway CDN API Users
FOD
Dispatcher
COD
Dispatcher
Aggregate
Engine
COD
Cache
FOD
Cache
Outbound API (OAPI)
- Fine grained RESTful API
- Intelligent Cache Management
- Accesses datastores, system of records and
other services
Aggregate Engine
- Aggregates fine grained APIs
- Aggregates at Edge through CDN ESI
25. Architecture: Request Flow
AZ A
AZ B
Public Subnet
Public Subnet
Private
Subnet
Internet
Private
Subnet
OAPI
FOD Cache
COD Cache
FOD
COD
OAPI
26. Distribution
Services
Architecture: Distribution – Content (Articles, Images, Video)
D
R
U
P
A
L
C
M
S
Metadata Store
Images
Videos
Asset
Metadata
Image Cut Service
Video Distribution
Services
Generic Asset
Service
mRSS Feeds
Metadata
Metadata
Static Asset Pools
S3
27. Architecture: Gateway
Data Processing Engine
Data Services
Storage
Systems of
Record
GatewayCDN
- Authentication
- Routing
- Metering
- Throttling
- CDN Aware, CDN Driven
- Remember 25ms latency target!
- We rolled our own
28. Architecture: Gateway
API
Users
CDN
Authentication,
metering, Throttling
Quick Response
Caching routingOrigin routing
Source of
Authentication
Truth
- User makes API request
- CDN checks authorization - Look Aside
- If authorized, check cache
- If cache-miss, hit origin caching/routing
- If origin cache-miss, pass through to backend servers
29. Architecture: The Other Side – Events & Analytics!
Data Lake
Operational
Analytics
Business
Analytics
Executive
Dashboards
Data
Discovery
Data
Science
3rd Party
System
Integration
Stream
Processing
Long Term Raw Storage
Short Term Storage and
Big Data Processing
Consumers
Amazon SQS
Streaming
Custom
Ingestion
Pipeline
Events
3rd Party
Other DBs
S3
Batch
Sources
Streaming
Sources
ETL
Data Access
SQL
30. Architecture: Putting it all together
Data Processing Engine
Data Services
Storage
Systems of
Record
GatewayCDN
31. Architecture: Implementation
Global Region 2
Global
Region 3
Global
Region 4
Global Region 1
Global Traffic Management
and CDN
Remote
Ingestion
Remote
Ingestion
FOD FOD FOD
Global Region 2
MonitoringConfiguration Mgmt Automation
Partner Data Sources:
(Weather, Alerts, Traffic, etc)
Distribution Engine Distribution Engine Distribution Engine
FOD
Distribution Engine
33. A curve ball !
Challenge:
• New deal struck with a
MAJOR mobile phone
company
• Ship new API
• Time to Market = 3 months
• Scale to 25+ billion
requests per day
34. Some findings
Architecture Already Decoupled
- Focus on Scaling Distribution Layer
Findings in Cycle:
- Load Testing / Tuning
- VPC NAT Saturation
- DNS Servers Sizing
- Instance Types and Characteristics
- OS Kernel Limits
- Destructive Testing / Fixing
- Brought Down instances, AZs,
Regions
- Corrupted caches, databases
Load Test
Tune
Destructive
Test
Fix
35. KEY TAKEAWAY
It takes time to figure all this out … so
please budget time and resources for both
load and destructive testing
40. Which NoSQL?
+ Write performance
more critical than
durability
+ Native multi-X
replication
+ Ecosystem
– Repartitioning
– Operational burden
– Data transfer cost
+ “Zero downtime”
+ Cross-region
replication
– Repartitioning
– Operational burden
– Data transfer cost
+ Managed solution
+ Easy to scale
+ Constantly
Evolving
– Item size
– Cross-region
replication
Storage
DynamoDB
41. Stream Storage
Building a DPE – AWS Style
Decouple producers &
consumers
Temporary buffer
Preserve client ordering
Streaming MapReduce
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
Shard 1
Shard 2
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
Producer 2
Producer 3
Producer N
Key = Red
Key = Green
Data Processing Engine
42. Which Stream Store Should I Use?
Amazon Kinesis and Apache Kafka have many similarities
• Multiple consumers
• Ordering of records
• Streaming MapReduce
• Low latency
• Highly durable, available, and scalable
Differences
• Record lifetime: 24 hours in Amazon Kinesis, configurable in Kafka
• Record size: 1MB/record in Amazon Kinesis, configurable in Kafka
• Amazon Kinesis is a fully managed service
• Easier to provision, manage, and scale
Data Processing Engine
43. Server-less Approach to DPE
Data Input Amazon
Kinesis
Action AWS
Lambda
Data Output
IT application activity
Capture the
stream
Audit
Process the
stream
SNS
Metering records Condense Redshift
Change logs Backup S3
IoT Device Data Store RDS
Transaction orders Process SQS
Server health metrics Monitor EC2
Data Processing Engine
46. Architectural Evolution: Technical Stack
Ingest
- Queue:
- Amazon
SQS
- Stream
- Kafka
- Micro DPE
- Avro
- Thrift
- Proto-buffs
- Micro-Services
Type of Model For
Ingest
Distribution
- Micro Services
- Language Polyglot
- Service Discovery
Storage
- Amazon Aurora
- BYOS
Analytics
- Parquet +
Amazon S3
- Spark
- Amazon EMR
47. Wrapping Up!
- Have an Architectural
Blueprint
- Keep Decoupled or
Loosely Coupled Layers
- Communication via
Standard Protocols
- Keep Architectural Plan
“Technology Agnostic”
- Storage Polyglot
- Language Polyglot
- Be Aware of the
Monoliths!
- Keep Caching
Architecture Simple – TTL
Driven
- Always Budget for
- Load Testing
- Destructive Testing
48. Related Sessions
ARC309 - From Monolithic to Microservices: Evolving Architecture
Patterns in the Cloud - Thursday
ARC301 - Scaling Up to Your First 10 Million Users - Thursday
BDT310 - Big Data Architectural Patterns and Best Practices on
AWS – Today 2:45 PM
BDT403 - Best Practices for Building Real-time Streaming
Applications with Amazon Kinesis - Thursday