Aerospike In The Age Of Real-Time BigData1. IN-MEMORY NOSQL
FIRST FLASH OPTIMIZED IN-MEMORY
NOSQL DATABASE
NOW OPEN SOURCE!
KHOSROW AFROOZEH
ENGINEER
Aerospike aer . o . spike [air-oh- spahyk]
noun, 1. tip of a rocket that enhances speed and stability
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
2. Aerospike – Built for the Age of the Millions Of Customers
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
■ The Gold Standard
7 of top 16
powered by Aerospike
(after Google, FB, from
BuiltWith.com )
3. “We run Aerospike heavily, peaking at 3 Million reads per second and well over
1 1/2 million writes a second in a very cost effective way. I don’t think there’s
any technology we’ve run into that even comes close.”
– Geir Magnusson, CTO of AppNexus
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
4. Who Uses Aerospike?
INTELLIGENT & INSTANT INTERNET-SCALE INTERACTIONS
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
5. MARKET FORCES
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
6. REQUIREMENTS FOR INTERNET ENTERPRISES
1. Know who the Interaction is with
■ Monitor 200+ Million US Consumers,
5+ Billion mobile devices and sensors
2. Determine intent based on
current context
■ Page views, search terms, game state, last
purchase, friends list, ads served, location
3. Respond now, use big data for
more accurate decisions
■ Display the most relevant Ad
■ Recommend the best product
■ Deliver the richest gaming experience
■ Eliminate fraud…
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
4. 100% up-time!
7. Real-time Transactions
Response time: < 10 ms
1-20 TB
Balanced Reads/Writes
24x7x365 Availability
REAL-TIME BIG DATA
Response time: Hours, Weeks
TB to PB
Read Intensive
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
Response time: Seconds
Gigabytes of data
Balanced Reads/Writes
TRANSACTIONS (OLTP)
ANALYTICS (OLAP)
STRUCTURED
DATA
Response time: Seconds
Terabytes of data
Read Intensive
BIG DATA ANALYTICS
UNSTRUCTURED
DATA
DATABASE LANDSCAPE
8. Introduction to Advertising: Real-Time Bidding
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
9. North American RTB speeds & feeds
■ 1 to 6 billion cookies tracked
■ Some companies track 200M, some track 20B
■ Each bidder has their own data pool
■ Data is your weapon
■ Recent searches, behavior, IP addresses
■ Audience clusters (K-cluster, K-means) from offline Hadoop
■ “Remnant” from Google, Yahoo is about 0.6 million / sec
■ Facebook exchange: about 0.6 million / sec
■ “other” is 0.5 million / sec
Currently about 2.0M / sec in North America
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
10. Advertising requirements
■ 100 millisecond to 150 millisecond ad delivery
■ De-facto standard set in 2004 by Washington Post and others
■ North America is 70 to 90 milliseconds wide / Europe About
half of that
■ Two or Three data centers
■ Auction is limited to 30 milliseconds
■ Typically closes in 5 milliseconds
■ Winners have more data, better models – in 5 milliseconds
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
11. © 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
MILLIONS OF CONSUMERS
BILLIONS OF DEVICES
APP SERVERS
DATA
INSIGHTS WAREHOUSE
Advertising Technology Stack
WRITE CONTEXT
OPERATIONAL DB
WRITE REAL-TIME CONTEXT
READ RECENT CONTENT
PROFILE STORE
Cookies, email, deviceID, IP address, location,
segments, clicks, likes, tweets, search terms...
REAL-TIME ANALYTICS
Best sellers, top scores, trending tweets
BATCH ANALYTICS
Discover patterns,
segment data: location
patterns, audience
affinity
12. Financial Services – Intraday Positions
ACCOUNT
POSITIONS
Read/Write
Query
Start of Day
Data Loading
End of Day
Reconciliation
LEGACY DATABASE
(MAINFRAME)
REAL-TIME
DATA FEED
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
XDR
10M+ user records
Primary key access
1M+ TPS planned
Finance App
Records App
RT Reporting App
13. Poll for
Pricing
Changes
PRICING
DATA
Store
Latest
Price
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
Travel Portal
PRICING DATABASE
(RATE LIMITED)
SESSION
MANAGEMENT
Session
Data Read
Price
XDR
Airlines forced interstate
banking
Legacy mainframe
technology
Multi-company reservation
and pricing
Requirement: 1M TPS
allowing overhead
Travel App
14. QOS & Real-Time Billing for Telcos
■ In-switch Per HTTP request Billing
■ US Telcos: 200M subscribers, 50 metros
Execute Request
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
■ In-memory use case
SOURCE
DEVICE/ USER
Hot Standby
Real-time
Checks
DESTINATION
Request
Update
Device
User
Settings
XDR
Real-time Auth. QoS Billing
Config Module App
15. Old Architecture ( mid 2000s )
CONTENT
DELIVERY NETWORK
Request routing and sharding
LOAD BALANCER
APP SERVERS
CACHE
DATABASE
STORAGE
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
16. Modern Scale Out Architecture
Load balancer
Simple stateless
APP SERVERS
CONTENT
DELIVERY NETWORK
IN-MEMORY NoSQL
© 2014 Aerospike. All rights reserved
RESEARCH
WAREHOUSE
LOAD BALANCER
Fast stateless Long term cold storage
17. Modern Scale Out Architecture
Load LOAD balancer
BALANCER
Simple stateless
APP SERVERS
CONTENT
DELIVERY NETWORK
Fast stateless Long term cold storage
IN-MEMORY NoSQL
RESEARCH
WAREHOUSE
HDFS BASED
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
18. © 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
ARCHITECTURE
19. Architecture – The Big Picture
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
1) No Hotspots
– DHT simplifies data
partitioning
2) Smart Client – 1 hop to data,
no load balancers
3) Shared Nothing Architecture,
every node identical
7) XDR – sync replication across data
centers ensures Zero Downtime
8) Scale linearly as data-sizes and
workloads increase
9) Add capacity with no service
interruption
4) Single row ACID
– synch replication in cluster
5) Smart Cluster, Zero Touch
– auto-failover, rebalancing,
rack aware, rolling upgrades..
6) Transactions and long running
tasks prioritized real-time
20. SHARED-NOTHING SYSTEM: 100% DATA AVAILABILITY
■ Every node in a cluster is identical,
handles both transactions and long
running tasks
■ Data is replicated synchronously with
immediate consistency within the
cluster
■ Data is replicated asynchronously
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
across data centers
OHIO Data Center
21. ROBUST DHT TO ELIMINATE HOT SPOTS
How Data Is Distributed (Replication Factor 2)
■ Every key is hashed into a
20 byte (fixed length) string
using the RIPEMD160 hash function
■ This hash + additional data
(fixed 64 bytes)
are stored in RAM in the index
■ Some bits from this hash value are used
to compute the partition id
■ There are 4096 partitions
■ Partition id maps to node id
based on cluster membership
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
cookie-abcdefg-12345678
182023kh15hh3kahdjsh
Partition
ID
Master
node
Replica
node
… 1 4
1820 2 3
1821 3 2
4096 4 1
22. REAL-TIME PRIORITIZATION TO MEET SLA
transactions
continue
Writing with Immediate Consistency Adding a Node
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
master replica
1. Write sent to row master
2. Latch against simultaneous writes
3. Apply write to master memory and replica
memory synchronously
4. Queue operations to disk
5. Signal completed transaction (optional storage
commit wait)
6. Master applies conflict resolution policy
(rollback/ rollforward)
1. Cluster discovers new node via gossip
protocol
2. Paxos vote determines new data organization
3. Partition migrations scheduled
4. When a partition migration starts, write journal
starts on destination
5. Partition moves atomically
6. Journal is applied and source data deleted
23. INTELLIGENT CLIENT TO MAKE APPS SIMPLER
Shield Applications from the Complexity of the Cluster
■ Implements Aerospike API
■ Optimistic row locking
■ Optimized binary protocol
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
■ Cluster tracking
■ Learns about cluster changes,
partition map
■ Transaction semantics
■ Global transaction ID
■ Retransmit and timeout
■ Linear scale
■ No extra hop
■ No load balancers
24. • Direct Device Access
• Large Block Writes
• Indexes in DRAM
• Highly Parallelized
• Log-structured FS “copy-on-write”
• Fast restart with shared memory
Ask me and I’ll Ask me. I’ll look up the answer and then I’ll let you tell you the answer.
OPEN NVM
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
OTHER DATABASE
OS FILE SYSTEM
PAGE CACHE
BLOCK INTERFACE
SSD HDD
BLOCK INTERFACE
SSD SSD
SSD
OTHER
DATABASE
AEROSPIKE FLASH OPTIMIZED
IN-MEMORY DATABASE
know.
AEROSPIKE
HYBRID MEMORY SYSTEM™
FLASH OPTIMIZED HIGH
PERFORMANCE
25. FLASH PROVIDES DRAM-LIKE PERFORMANCE WITH
MUCH LOWER COMPLEXITY & TCO
Actual customer analysis.
Customer requires 500K TPS,
10 TB of storage, with
Storage type DRAM & NoSQL SSD & DRAM
Storage per server 180 GB (196 GB Server) 2.4 GB (4 x 700 GB)
TPS per server 500,000 500,000
Cost per server $8,000 $11,000
Server costs $1,488,000 $154,000
Power/server 0.9 kW 1.1 kW
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
Power (2 years) $0.12 per kWh ave.
US
$352,000 $32,400
Maintenance (2 years) $3,600 per
server
$670,000 $50,400
Total $2,510,000 $236,800
2x replication factor.
186 SERVERS REQUIRED 14 SERVERS REQUIRED
OTHER DATABASES
ONLY
26. High Availability Through Clustering & Replication
1 2 3 4 5 Phases
1) 100KTPS – 4 nodes
2) Clients at Max
3) 400KTPS – 4 nodes
4) 400KTPS – 3 nodes
5) 400KTPS – 4 nodes
Aerospike Node Specs:
CentOS 6.3
Intel i5-2400@ 3.1 GHz (Quad core)
16 GB RAM@1333 MHz
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
27. HOT ANALYTICS
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
28. Key Value Store + Lists, Maps
■ Namespaces (policy containers)
■ Determine storage - DRAM or Flash
■ Determine replication factor
■ Contain records and sets
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
■ Sets (tables) of records
■ Arbitrary grouping
■ Records (rows) of key/bins
■ Block size (128KiB – 2MiB)
■ Bin with same name can contain values
of different types
■ String, integer, bytes (raw, blob, etc)
■ List ( an ordered collection of values )
■ Map ( a collection of keys and values )
■ Bins can be added anytime
■ Meta data
■ Generation counter so apps can ensure that a
record was not modified since last read
■ Time-To-Live value for auto expiration, keeping
most recent context or "hot" data, aging out
historical context
29. KVS + Lists, Maps + Queries + UDFs
STREAM AGGREGATIONS
(INDEXED MAP-REDUCE)
Pipe Query results
through UDFs
■ Filter, Transform,
Aggregate: Map,
Reduce
■ Enforce security
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
■ UDFs in Lua to
■ CRUD on record
■ Calculation
based on data
within a record
■ Iterate through a
set / namespace
of records
■ UDFs for real-time
analytics and
aggregations
30. LOW SELECTIVITY INDEX QUERIES
1. Query sent to ALL nodes in parallel
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
“SCATTER”
2. Secondary Index keys in DRAM
■ Map to Primary keys in DRAM
■ Co-located with Record on SSD
3. Records read in parallel from ALL SSDs
4. Parallel read results aggregated on
node
5. Results from ALL nodes
aggregated client-side
“GATHER”
Secondary Keys
Primary Keys
Records R1, R2
DRAM
SSD
Server
Client
…
Keys Keys
R3, R4 R5, R4
V1 V2 V3 V4 V5 V6
31. Server
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
SQL & NoSQL
➤ Secondary index
▪ Equality, Range, Compound
▪ e.g. WHERE group_id = 1234,
WHERE last_activity > 1349293398, WHERE
branch_id BETWEEN 19812 AND 1987139
➤ Filters
▪ SQL: Where clause with non-indexed “AND”s
(e.g. “AND gender=‘M’ ”)
▪ NOSQL: Map step
➤ Aggregation
▪ SQL: GROUP BY, ORDER BY, LIMIT,
OFFSET
▪ NOSQL: Reduce step
Secondary Key
Primary Key
Record
Filter Map
Aggregate
DRAM
SSD
Aggregate
Client
Client
Reduce
Aggregate
Query
32. Operational + Analytics + Adding servers and Re-balancing
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
■ 300k TPS Operations +
Process 1 Million records
■ Runs in 0.5 seconds
■ Add 2 servers,
auto-rebalance
while running query
33. © 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
Performance
34. Native Flash ! Performance
Balanced Workload Read Latency
Average Latency,
ms
9
6.75
4.5
2.25
0
0 50,000 100,000 150,000 200,000
Throughput, ops/sec
14
10.5
7
3.5
0 50,000 100,000 150,000 200,000
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
400,000
300,000
200,000
100,000
0
Balanced Read-Heavy
Aerospike
Cassandra
MongoDB
Couchbase 2.0*
* “We were forced to exclude Couchbase… since when run with either disk
or replica durability on, it was unable to complete the test.”
– Thumbtack Technology
Aerospike
Cassandra
MongoDB
Balanced Workload Update Latency
Average Latency,
ms
0
Throughput, ops/sec
Aerospike
Cassandra
MongoDB
HIGH THROUGHPUT LOW LATENCY
Throughput, TPS
35. Updated YCSB Benchmark
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
■ What’s different?
■ Aerospike 3.2.8 instead of Aerospike 2
■ Stock irqbalance is smarter
■ “After Burner” script maps threads/cores to
cpu sockets, no copies across NUMA
nodes
■ Minimized context switching, branch
instructions
■ 10G network instead of 1 G
36. 2014: 1 M TPS on Single Server
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
37. © 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
Hot Analytics
■ High throughput Queries
■ 2 node cluster, 10 Indexes
■ Query returns 100 of 50M
records
■ Predictable low latency
UN-PREDICTABLE LATENCY
128 – 300 ms
70 – 760 ms
7 – 10 ms
QPS
38. © 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
Amazon EC2 results
39. © 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
Amazon EC2 results
40. © 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
Amazon EC2 results
41. LESSONS LEARNED
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
42. LESSONS LEARNED
1. Keep architecture simple
■ No hot spots (e.g., robust DHT)
■ Scales out easily (e.g., easy to size)
■ Avoids points of failure (e.g., single node type)
2. Avoid manual operation – automate, automate!
■ Self-managed cluster responds to node failures
■ Data rebalancing requires no intervention
■ Real-time prioritization allows unattended system operation
3. Keep system asynchronous
■ Shared nothing – nodes are autonomous
■ Async writes across data centers
■ Independent tuning parameters for different classes of tasks
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
43. LESSONS LEARNED (cont’d)
4. Monitor the Health of the System Extensively
■ Growth in load sneaks up on you over weeks
■ Early detection means better service
■ Most failures can be predicted (e.g., capacity, load, …)
5. Size clusters properly
■ Have enough capacity ALWAYS!
■ Upgrade SSDs every couple years
■ Reduce cluster sizes to make operations simple
6. Have geographically distributed data centers
■ Size the distributed data centers properly
■ Use active-active configurations if possible
■ Size bandwidth requirements accurately
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
44. LESSONS LEARNED (cont’d)
7. Have plan for unforeseen situations
■ Devise scenarios and practice during normal work time
■ Ensure you can do rolling upgrades during high load time
■ Make sure that your nodes can restart fast (< 1 minute)
8. Constantly test and monitor app end-to-end
■ Application level metrics are more important than DB metrics
■ Most issues in a service are due to a combination of application, network,
database, storage, etc.
9. Separate online and offline workloads
■ Reserve real-time edge database for transactions and hot analytics queries (where
newest data is important)
■ Avoid ad-hoc queries on on-line system
■ Perform deep analysis in offline system (Hadoop)
10. Use the Right Data Management System for the job
■ Fast NoSQL DB for real-time transactions and hot analytics on rapidly changing
data
■ Hadoop or other comparable systems for exhaustive analytics on mostly read-only
data
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
45. MODERN REAL-TIME DATA PLATFORM
Written in ‘C’,
Patents pending
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
1. Scaling the Internet of Everything
2. Pushing the limits of modern hardware
3. No data loss (ACID) and No downtime
AEROSPIKE SERVER APP SERVER
REAL-TIME BIG DATA APPLICATION
AEROSPIKE SMART CLIENT™
• APIs (C, C#, Java, Go, PHP, Python, Ruby, Node, Erlang…)
• Transactions, Cluster awareness
EXTENSIBLE DATA MODEL
• Str, Int, Lists, Maps
• Lookups, Queries, Scans
• Aerospike Alchemy
Framework™
with User Defined Functions
and Distributed Aggregations
MONITORING &
MANAGEMENT
• Aerospike Monitoring
Console™
• Command Line Tools
• Plugins-Nagios, Graphite,
Zabbix
AEROSPIKE SMART CLUSTER™
AEROSPIKE HYBRID
MEMORY SYSTEM™
PROXIMITY & REDUNDANCY
Cross Data Center Replication™
(XDR)
REAL-TIME
ENGINE
APP/WEB
SERVER
AEROSPIKE
CLUSTER
46. Server, Storage, Cloud benchmarks, partnerships
© 2014 Aerospike. All rights reserved. Confidential. | Berlin Big Data Beers October 29, 2014
47. © 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin – October 29, 2014 |
SUMMARY
Rapid Development Complete Customizability
➤ Support for popular languages and
tools
▪ AQL and Aerospike Client in C,
Java, C#, Go, Ruby, Python, …
➤ Complex data types
▪ Nested documents
(map, list, string, integer)
▪ Large (Stack, Set, List) Objects
➤ Queries
▪ Single Record
▪ Batch multi-record lookups
▪ Equality and Range
▪ Aggregations and Map-Reduce
➤ User Defined Functions
▪ In-DB processing
➤ Aggregation Framework
▪ UDF Pipeline
▪ MapReduce
➤ Time Series Queries
▪ Just 2 IOPs for most r/w
(independent of object
size)
49. We are hiring hardcore system developers.
© 2014 Aerospike, Inc. All rights reserved. Confidential. | Berlin Big Data Beers - October 29, 2014
Join us!
@khaf
khosrow@aerospike.com