© 2014 Aerospike. All rights reserved. Confidential 1
What Enterprises Can Learn
from Real-time Bidding
How, and why, to achieve
Operational Big Data
Brian Bulkowski
CTO and co-founder
Aerospike
© 2014 Aerospike. All rights reserved. Confidential 2
REQUIREMENTS FOR INTERNET ENTERPRISES
© 2014 Aerospike. All rights reserved. Confidential 3
Introduction to Advertising: Real-time Bidding
© 2014 Aerospike. All rights reserved. Confidential 4
North American RTB speeds & feeds
■  1 to 6 billion cookies tracked
■  Some companies track 200M, some track 20B
■  Each bidder has their own data pool
■  Data is your weapon
■  Recent searches, behavior, IP addresses
■  Audience clusters (K-cluster, K-means) from offline Hadoop
■  “Remnant” from Google, Yahoo is about 0.6 million / sec
■  Facebook exchange: about 0.6 million / sec
■  “other” is 0.5 million / sec
Currently about 3.0M / sec in North American
© 2014 Aerospike. All rights reserved. Confidential 5
Advertising requirements
■  100 millisecond or 150 millisecond ad delivery
■  De-facto standard set in 2004 by Washington Post and others
■  North America is 70 to 90 milliseconds wide
■  Two or three data centers
■  Auction is limited to 30 milliseconds
■  Typically closes in 5 milliseconds
■  Winners have more data, better models – in 5 milliseconds
© 2014 Aerospike. All rights reserved. Confidential 6
MILLIONS OF CONSUMERS
BILLIONS OF DEVICES
APP SERVERS
DATA
WAREHOUSEINSIGHTS
Advertising Technology Stack
WRITE CONTEXT
OPERATIONAL DB
WRITE REAL-TIME CONTEXT
READ RECENT CONTENT
PROFILE STORE
Cookies, email, deviceID, IP address, location,
segments, clicks, likes, tweets, search terms...
REAL-TIME ANALYTICS
Best sellers, top scores, trending tweets
BATCH ANALYTICS
Discover patterns,
segment data:
location patterns,
audience affinity
© 2014 Aerospike. All rights reserved. Confidential 7
Financial Services – Intraday Positions
LEGACY DATABASE
(MAINFRAME)
Read/Write
Start of Day
Data Loading
End of Day
Reconciliation
Query
REAL-TIME
DATA FEED
ACCOUNT
POSITIONS
XDR
10M+ user records
Primary key access
1M+ TPS planned
Finance App
Records App
RT Reporting App
© 2014 Aerospike. All rights reserved. Confidential 8
Social Media
MYSQL or POSTGRES
(ROTATIONAL DISK)
Recent user
generated content
Java application tier
Data abstraction
and sharding
MODIFIED REDIS
(SSD ENABLED)
Content and
Historical data
© 2014 Aerospike. All rights reserved. Confidential 9
Travel Portal
PRICING DATABASE
(RATE LIMITED)
Poll for
Pricing
Changes
PRICING
DATA
Store
Latest
Price
SESSION
MANAGEMENT
Session
Data
Read
Price
XDR
Airlines forced interstate
banking
Legacy mainframe
technology
Multi-company
reservation and pricing
Requirement: 1M TPS
allowing overhead
Travel App
© 2014 Aerospike. All rights reserved. Confidential 10
SOURCE
DEVICE/ USER
QOS & Real-Time Billing for Telcos
■  In-switch Per HTTP request Billing
■ US Telcos: 200M subscribers, 50 metros
■  In-memory use case
Hot Standby
Execute Request
Real-time
Checks
DESTINATION
Update
Device
User
Settings
Request
XDR
Real-time Auth. QoS Billing
Config Module App
© 2014 Aerospike. All rights reserved. Confidential 11
MILLIONS OF CONSUMERS
BILLIONS OF DEVICES
APP SERVERS
BATCH
ANALYTICSINSIGHTS
BATCH ANALYTICS
The New Architecture
WRITE CONTEXT
TRANSACTIONS &
HOT ANALYTICS
© 2014 Aerospike. All rights reserved. Confidential 12
Old Architecture ( scale out in 2000 )
Request routing and sharding
APP SERVERS
CACHE
DATABASE
STORAGE
CONTENT
DELIVERY NETWORK
LOAD BALANCER
© 2014 Aerospike. All rights reserved. Confidential 13
Modern Scale Out Architecture
Load balancer
Simple stateless
APP SERVERS
IN-MEMORY NoSQL
RESEARCH
WAREHOUSE
CONTENT
DELIVERY NETWORK
LOAD BALANCER
Long term cold storageFast stateless
© 2014 Aerospike. All rights reserved. Confidential 14
Modern Scale Out Architecture
Load balancer
Simple stateless
APP SERVERS
IN-MEMORY NoSQL
RESEARCH
WAREHOUSE
CONTENT
DELIVERY NETWORK
LOAD BALANCER
Long term cold storageFast stateless
HDFS BASED
© 2014 Aerospike. All rights reserved. Confidential 15
Build a data layer
Use open source
Focus on Key Value
Use In-memory NoSQL
Use Flash
© 2014 Aerospike. All rights reserved. Confidential 16
How Fast You Can Go
And
How To Do It
© 2014 Aerospike. All rights reserved. Confidential 17
SHARED-NOTHING SYSTEM:100% DATA
AVAILABILITY
■  Every node in a cluster is identical,
handles both transactions and long
running tasks
■  Data is replicated synchronously with
immediate consistency within the cluster
■  Data is replicated asynchronously
across data centers
OHIO Data Center
© 2014 Aerospike. All rights reserved. Confidential 18
LESSONS
1.  Optimize key-value code paths
■  No hot spots (e.g., robust DHT)
■  Scales up easily (e.g., easy to size)
■  Avoids points of failure (e.g., single node type)
■  Binary protocol
2.  Code in C
■  Read() / Write() / Linux AIO ( don’t trust a library )
■  Multithread !
■  Direct device
■  C++ could work, leads to structural complexity
3.  Memory allocation matters
■  Stack based allocators
■  Own stack allocator
■  JEMalloc for pools
© 2014 Aerospike. All rights reserved. Confidential 19
LESSONS (cont’d)
4.  Innovation: masters in a shared nothing system
■  Fast cluster organization
■  Fast transaction capabilities
■  Can be CP or AP - and resolve data accurately
5.  Clients are hard
■  Fast stable connection pools are hard
■  API design matters
■  Slow languages need Aerospike more
6.  Aggregations / queries required
■  Row oriented
■  Secondary indexes as filters, and MR style
■  Data transformation hurts
© 2014 Aerospike. All rights reserved. Confidential 20
LESSONS (cont’d)
7.  Network interrupts are painful
■  TCP is still better than UDP
■  Some great hardware out there (solarflare)
■  Network queues, RRD
■  New PCI-e, other interfaces are not there yet
■  Larger clusters solve interface issues
8.  “Large data types” (better documents)
■  Ordered List, Map, Set, Stack
■  Time series and documents
■  Beyond per-row storage layout, more optimal than document
9.  UDFs for flexibility
© 2014 Aerospike. All rights reserved. Confidential 21
WRITING RELIABILY WITH HIGH PERFORMANCE
1.  Write sent to row master
2.  Latch against simultaneous writes
3.  Apply write to master memory and replica
memory synchronously
4.  Queue operations to disk
5.  Signal completed transaction (optional
storage commit wait)
6.  Master applies conflict resolution policy
(rollback/ rollforward)
master replica
1.  Cluster discovers new node via gossip
protocol
2.  Paxos vote determines new data
organization
3.  Partition migrations scheduled
4.  When a partition migration starts, write
journal starts on destination
5.  Partition moves atomically
6.  Journal is applied and source data deleted
transactions
continue
Writing with Immediate Consistency Adding a Node
© 2014 Aerospike. All rights reserved. Confidential 22
Key Value Store + Lists, Maps
■  Namespaces (policy containers)
■  Determine storage - DRAM or Flash
■  Determine replication factor
■  Contain records and sets
■  Sets (tables) of records
■  Arbitrary grouping
■  Records (rows) of key/bins
■  Block size (128k – 2MB)
■  Bin with same name can contain values
of different types
■  String, integer, bytes (raw, blob, etc)
■  list ( an ordered collection of values )
■  map ( a collection of keys and values )
■  Bins can be added anytime
■  Meta data
■  Generation counter so apps can ensure that a
record was not modified since last read
■  Time-to-live value for auto expiration, keeping most
recent context or "hot" data, aging out historical
context
© 2014 Aerospike. All rights reserved. Confidential 23
KVS + Lists, Maps + Queries + UDFs
STREAM
AGGREGATIONS
(INDEXED MAP-REDUCE)
Pipe Query results
through UDFs
■  Filter, Transform,
Aggregate.. Map,
Reduce
■  Enforce security
■  UDFs in Lua to
■ CRUD a record
■ Calculation based
on data within a
record
■ Iterate through a
set / namespace of
records
■  UDFs for real-time
analytics and
aggregations
© 2014 Aerospike. All rights reserved. Confidential 24
SQL & NoSQL
➤ Secondary index
§  Equality, Range, IN (,,,), Compound
§  e.g. WHERE group_id = 1234,
WHERE last_activity > 1349293398, WHERE
branch_id IN (5,6,7,8)
➤ Filters
§  SQL: Where clause with non-indexed “AND”s
(e.g. “AND gender=‘M’ ”)
§  NOSQL: Map step
➤ Aggregation
§  SQL: GROUP BY, ORDER BY, LIMIT, OFFSET
§  NOSQL: Reduce step
Secondary Key
Primary Key
Record
Filter Map
Aggregate
DRAM
SSD
Aggregate
Client
Client
Server
Reduce
Aggregate
Query
© 2014 Aerospike. All rights reserved. Confidential 25
Flash storage
The Power of Flash Storage
© 2014 Aerospike. All rights reserved. Confidential 26
DATABASE
OS FILE SYSTEM
PAGE CACHE
BLOCK INTERFACE
SSD HDD
BLOCK INTERFACE
SSD SSD
OPEN NVM
SSD
Ask me and I’ll tell you the answer.Ask me. I’ll look up the answer and then tell it to
you.
DATABASE
HYBRID MEMORY SYSTEM™
•  Direct device access
•  Large Block Writes
•  Indexes in DRAM
•  Highly Parallelized
•  Log-structured FS “copy-on-write”
•  Fast restart with shared memory
FLASH OPTIMIZED HIGH
PERFORMANCE
© 2014 Aerospike. All rights reserved. Confidential 27© 2012 Aerospike. All rights reserved. Pg. 27
Measure your drives!
Aerospike Certification Tool (ACT)
http://github.com/aerospike/act
Transactional database workload
Reads: 1.5KB
(can’t batch / cache reads, random)
Writes: 128K blocks
(log based layout)
(plus defragmentation)
Turn up the load until
latency is over required SLA
© 2014 Aerospike. All rights reserved. Confidential 28
Aerospike’s Flash Experience
■  Know your Flash
■ ACT benchmark http://github.com/aerospike/act
■ Read-write benchmark results back to 2011
■  All clouds support flash now
■ New EC2 instances
■ Google Compute
■ Internap, Softlayer, GoGrid…
■  Write durability usually not a problem with modern flash
■ Durability is high (5 “drive writes per day” for 5 years, etc)
■ Read performance suffers under write load anyway
© 2014 Aerospike. All rights reserved. Confidential 29
Aerospike’s Flash Experience
■  Densities increasing
■ 100G 2 years ago à 800G today
■ SATA vs PCI-E
■ Appliances: 50T per 1U this year
■  Prices still dropping: perhaps $1/G next year
■  Intel P3700 results
■ 250K per device @ $2.5 / G
■ Old standard: Micron P320h 500K @ $8 / G
■  “Wide SATA”
■ 20 SATA drives
■ LSI “pass through mode”
■ 250K+ per server
© 2014 Aerospike. All rights reserved. Confidential 30
Use Open Source
© 2014 Aerospike. All rights reserved. Confidential 31
Aerospike: the trusted In-Memory NoSQL
Performance
• Over ten trillion transactions per month
• 99% of transactions < 2 ms
• 150K TPS per server
Scalability
• Billions of Internet users
• Clustered Software
• Maintenance without downtime
• Scale up & scale out
Reliability
• 50 customers; zero down-time
• Immediate Consistency
• Rapid Failover; Data Center Replication
Price/Performance
• Makes impossible projects affordable
• Flash-optimized
• 1/10 the servers required
© 2014 Aerospike. All rights reserved. Confidential 32

What enterprises can learn from Real Time Bidding (RTB)

  • 1.
    © 2014 Aerospike.All rights reserved. Confidential 1 What Enterprises Can Learn from Real-time Bidding How, and why, to achieve Operational Big Data Brian Bulkowski CTO and co-founder Aerospike
  • 2.
    © 2014 Aerospike.All rights reserved. Confidential 2 REQUIREMENTS FOR INTERNET ENTERPRISES
  • 3.
    © 2014 Aerospike.All rights reserved. Confidential 3 Introduction to Advertising: Real-time Bidding
  • 4.
    © 2014 Aerospike.All rights reserved. Confidential 4 North American RTB speeds & feeds ■  1 to 6 billion cookies tracked ■  Some companies track 200M, some track 20B ■  Each bidder has their own data pool ■  Data is your weapon ■  Recent searches, behavior, IP addresses ■  Audience clusters (K-cluster, K-means) from offline Hadoop ■  “Remnant” from Google, Yahoo is about 0.6 million / sec ■  Facebook exchange: about 0.6 million / sec ■  “other” is 0.5 million / sec Currently about 3.0M / sec in North American
  • 5.
    © 2014 Aerospike.All rights reserved. Confidential 5 Advertising requirements ■  100 millisecond or 150 millisecond ad delivery ■  De-facto standard set in 2004 by Washington Post and others ■  North America is 70 to 90 milliseconds wide ■  Two or three data centers ■  Auction is limited to 30 milliseconds ■  Typically closes in 5 milliseconds ■  Winners have more data, better models – in 5 milliseconds
  • 6.
    © 2014 Aerospike.All rights reserved. Confidential 6 MILLIONS OF CONSUMERS BILLIONS OF DEVICES APP SERVERS DATA WAREHOUSEINSIGHTS Advertising Technology Stack WRITE CONTEXT OPERATIONAL DB WRITE REAL-TIME CONTEXT READ RECENT CONTENT PROFILE STORE Cookies, email, deviceID, IP address, location, segments, clicks, likes, tweets, search terms... REAL-TIME ANALYTICS Best sellers, top scores, trending tweets BATCH ANALYTICS Discover patterns, segment data: location patterns, audience affinity
  • 7.
    © 2014 Aerospike.All rights reserved. Confidential 7 Financial Services – Intraday Positions LEGACY DATABASE (MAINFRAME) Read/Write Start of Day Data Loading End of Day Reconciliation Query REAL-TIME DATA FEED ACCOUNT POSITIONS XDR 10M+ user records Primary key access 1M+ TPS planned Finance App Records App RT Reporting App
  • 8.
    © 2014 Aerospike.All rights reserved. Confidential 8 Social Media MYSQL or POSTGRES (ROTATIONAL DISK) Recent user generated content Java application tier Data abstraction and sharding MODIFIED REDIS (SSD ENABLED) Content and Historical data
  • 9.
    © 2014 Aerospike.All rights reserved. Confidential 9 Travel Portal PRICING DATABASE (RATE LIMITED) Poll for Pricing Changes PRICING DATA Store Latest Price SESSION MANAGEMENT Session Data Read Price XDR Airlines forced interstate banking Legacy mainframe technology Multi-company reservation and pricing Requirement: 1M TPS allowing overhead Travel App
  • 10.
    © 2014 Aerospike.All rights reserved. Confidential 10 SOURCE DEVICE/ USER QOS & Real-Time Billing for Telcos ■  In-switch Per HTTP request Billing ■ US Telcos: 200M subscribers, 50 metros ■  In-memory use case Hot Standby Execute Request Real-time Checks DESTINATION Update Device User Settings Request XDR Real-time Auth. QoS Billing Config Module App
  • 11.
    © 2014 Aerospike.All rights reserved. Confidential 11 MILLIONS OF CONSUMERS BILLIONS OF DEVICES APP SERVERS BATCH ANALYTICSINSIGHTS BATCH ANALYTICS The New Architecture WRITE CONTEXT TRANSACTIONS & HOT ANALYTICS
  • 12.
    © 2014 Aerospike.All rights reserved. Confidential 12 Old Architecture ( scale out in 2000 ) Request routing and sharding APP SERVERS CACHE DATABASE STORAGE CONTENT DELIVERY NETWORK LOAD BALANCER
  • 13.
    © 2014 Aerospike.All rights reserved. Confidential 13 Modern Scale Out Architecture Load balancer Simple stateless APP SERVERS IN-MEMORY NoSQL RESEARCH WAREHOUSE CONTENT DELIVERY NETWORK LOAD BALANCER Long term cold storageFast stateless
  • 14.
    © 2014 Aerospike.All rights reserved. Confidential 14 Modern Scale Out Architecture Load balancer Simple stateless APP SERVERS IN-MEMORY NoSQL RESEARCH WAREHOUSE CONTENT DELIVERY NETWORK LOAD BALANCER Long term cold storageFast stateless HDFS BASED
  • 15.
    © 2014 Aerospike.All rights reserved. Confidential 15 Build a data layer Use open source Focus on Key Value Use In-memory NoSQL Use Flash
  • 16.
    © 2014 Aerospike.All rights reserved. Confidential 16 How Fast You Can Go And How To Do It
  • 17.
    © 2014 Aerospike.All rights reserved. Confidential 17 SHARED-NOTHING SYSTEM:100% DATA AVAILABILITY ■  Every node in a cluster is identical, handles both transactions and long running tasks ■  Data is replicated synchronously with immediate consistency within the cluster ■  Data is replicated asynchronously across data centers OHIO Data Center
  • 18.
    © 2014 Aerospike.All rights reserved. Confidential 18 LESSONS 1.  Optimize key-value code paths ■  No hot spots (e.g., robust DHT) ■  Scales up easily (e.g., easy to size) ■  Avoids points of failure (e.g., single node type) ■  Binary protocol 2.  Code in C ■  Read() / Write() / Linux AIO ( don’t trust a library ) ■  Multithread ! ■  Direct device ■  C++ could work, leads to structural complexity 3.  Memory allocation matters ■  Stack based allocators ■  Own stack allocator ■  JEMalloc for pools
  • 19.
    © 2014 Aerospike.All rights reserved. Confidential 19 LESSONS (cont’d) 4.  Innovation: masters in a shared nothing system ■  Fast cluster organization ■  Fast transaction capabilities ■  Can be CP or AP - and resolve data accurately 5.  Clients are hard ■  Fast stable connection pools are hard ■  API design matters ■  Slow languages need Aerospike more 6.  Aggregations / queries required ■  Row oriented ■  Secondary indexes as filters, and MR style ■  Data transformation hurts
  • 20.
    © 2014 Aerospike.All rights reserved. Confidential 20 LESSONS (cont’d) 7.  Network interrupts are painful ■  TCP is still better than UDP ■  Some great hardware out there (solarflare) ■  Network queues, RRD ■  New PCI-e, other interfaces are not there yet ■  Larger clusters solve interface issues 8.  “Large data types” (better documents) ■  Ordered List, Map, Set, Stack ■  Time series and documents ■  Beyond per-row storage layout, more optimal than document 9.  UDFs for flexibility
  • 21.
    © 2014 Aerospike.All rights reserved. Confidential 21 WRITING RELIABILY WITH HIGH PERFORMANCE 1.  Write sent to row master 2.  Latch against simultaneous writes 3.  Apply write to master memory and replica memory synchronously 4.  Queue operations to disk 5.  Signal completed transaction (optional storage commit wait) 6.  Master applies conflict resolution policy (rollback/ rollforward) master replica 1.  Cluster discovers new node via gossip protocol 2.  Paxos vote determines new data organization 3.  Partition migrations scheduled 4.  When a partition migration starts, write journal starts on destination 5.  Partition moves atomically 6.  Journal is applied and source data deleted transactions continue Writing with Immediate Consistency Adding a Node
  • 22.
    © 2014 Aerospike.All rights reserved. Confidential 22 Key Value Store + Lists, Maps ■  Namespaces (policy containers) ■  Determine storage - DRAM or Flash ■  Determine replication factor ■  Contain records and sets ■  Sets (tables) of records ■  Arbitrary grouping ■  Records (rows) of key/bins ■  Block size (128k – 2MB) ■  Bin with same name can contain values of different types ■  String, integer, bytes (raw, blob, etc) ■  list ( an ordered collection of values ) ■  map ( a collection of keys and values ) ■  Bins can be added anytime ■  Meta data ■  Generation counter so apps can ensure that a record was not modified since last read ■  Time-to-live value for auto expiration, keeping most recent context or "hot" data, aging out historical context
  • 23.
    © 2014 Aerospike.All rights reserved. Confidential 23 KVS + Lists, Maps + Queries + UDFs STREAM AGGREGATIONS (INDEXED MAP-REDUCE) Pipe Query results through UDFs ■  Filter, Transform, Aggregate.. Map, Reduce ■  Enforce security ■  UDFs in Lua to ■ CRUD a record ■ Calculation based on data within a record ■ Iterate through a set / namespace of records ■  UDFs for real-time analytics and aggregations
  • 24.
    © 2014 Aerospike.All rights reserved. Confidential 24 SQL & NoSQL ➤ Secondary index §  Equality, Range, IN (,,,), Compound §  e.g. WHERE group_id = 1234, WHERE last_activity > 1349293398, WHERE branch_id IN (5,6,7,8) ➤ Filters §  SQL: Where clause with non-indexed “AND”s (e.g. “AND gender=‘M’ ”) §  NOSQL: Map step ➤ Aggregation §  SQL: GROUP BY, ORDER BY, LIMIT, OFFSET §  NOSQL: Reduce step Secondary Key Primary Key Record Filter Map Aggregate DRAM SSD Aggregate Client Client Server Reduce Aggregate Query
  • 25.
    © 2014 Aerospike.All rights reserved. Confidential 25 Flash storage The Power of Flash Storage
  • 26.
    © 2014 Aerospike.All rights reserved. Confidential 26 DATABASE OS FILE SYSTEM PAGE CACHE BLOCK INTERFACE SSD HDD BLOCK INTERFACE SSD SSD OPEN NVM SSD Ask me and I’ll tell you the answer.Ask me. I’ll look up the answer and then tell it to you. DATABASE HYBRID MEMORY SYSTEM™ •  Direct device access •  Large Block Writes •  Indexes in DRAM •  Highly Parallelized •  Log-structured FS “copy-on-write” •  Fast restart with shared memory FLASH OPTIMIZED HIGH PERFORMANCE
  • 27.
    © 2014 Aerospike.All rights reserved. Confidential 27© 2012 Aerospike. All rights reserved. Pg. 27 Measure your drives! Aerospike Certification Tool (ACT) http://github.com/aerospike/act Transactional database workload Reads: 1.5KB (can’t batch / cache reads, random) Writes: 128K blocks (log based layout) (plus defragmentation) Turn up the load until latency is over required SLA
  • 28.
    © 2014 Aerospike.All rights reserved. Confidential 28 Aerospike’s Flash Experience ■  Know your Flash ■ ACT benchmark http://github.com/aerospike/act ■ Read-write benchmark results back to 2011 ■  All clouds support flash now ■ New EC2 instances ■ Google Compute ■ Internap, Softlayer, GoGrid… ■  Write durability usually not a problem with modern flash ■ Durability is high (5 “drive writes per day” for 5 years, etc) ■ Read performance suffers under write load anyway
  • 29.
    © 2014 Aerospike.All rights reserved. Confidential 29 Aerospike’s Flash Experience ■  Densities increasing ■ 100G 2 years ago à 800G today ■ SATA vs PCI-E ■ Appliances: 50T per 1U this year ■  Prices still dropping: perhaps $1/G next year ■  Intel P3700 results ■ 250K per device @ $2.5 / G ■ Old standard: Micron P320h 500K @ $8 / G ■  “Wide SATA” ■ 20 SATA drives ■ LSI “pass through mode” ■ 250K+ per server
  • 30.
    © 2014 Aerospike.All rights reserved. Confidential 30 Use Open Source
  • 31.
    © 2014 Aerospike.All rights reserved. Confidential 31 Aerospike: the trusted In-Memory NoSQL Performance • Over ten trillion transactions per month • 99% of transactions < 2 ms • 150K TPS per server Scalability • Billions of Internet users • Clustered Software • Maintenance without downtime • Scale up & scale out Reliability • 50 customers; zero down-time • Immediate Consistency • Rapid Failover; Data Center Replication Price/Performance • Makes impossible projects affordable • Flash-optimized • 1/10 the servers required
  • 32.
    © 2014 Aerospike.All rights reserved. Confidential 32