DBS302:
Driving a Realtime
Personalization Engine
with Cloud Bigtable
Calvin French-Owen, Co-Founder & CTO, Segment
You’re making a hard
choice...
Our roadmap
- A bit of background
- Personas architecture
- BigQuery + Cloud Bigtable
- Making hard choices
A bit of background
- 19,000 users
- 300B monthly events
- 450B outbound API calls
- TB of data per day
Segment by the numbers
Under the hood...
API Kafka Consumer
DB
api.google.com
api.salesforce.com
api.intercom.io
api.mixpanel.com
The biggest advantage of
this system
The biggest advantage of
this system
It’s stateless
API Kafka Consumer
DB
api.google.com
api.salesforce.com
api.intercom.io
api.mixpanel.comAPI Kafka Consumer
In 2018... we started
getting a new set of
requirements
Personas brought some
decidedly stateful use
cases
The use cases of personas
1) Profile API
2) Identity resolution
3) Audience computation
- Query profiles in real-time
- Match users by identity
- Create audiences of users
Personas
Personas architecture
Let’s first talk about
lambda architectures...
- Data is sent to the batch and speed layers
- Batch layers runs bigger computations
- Speed layer serves real-time updates (+ diffs)
Lambda architecture
- Query profiles in real-time (speed)
- Match users by identity (speed)
- Create audiences of users (batch)
Personas
Different pipelines,
different datastores
Kafka Pubsub
BigQuery
Cloud
Bigtable
(batch)
(speed)
Worker
Worker
Kafka Pubsub
BigQuery
Cloud
Bigtable
(batch)
(speed)
Worker
Worker
Kafka -> Pub/Sub
Segment messages
- Tracking things like pageviews, user
events, etc
- Semi-structured JSON
- Typically ~1kb
- Hundreds of thousands of 1kb messages
- Published from Kafka to Cloud PubSub
- Writes data twice, once for realtime, once for batch
- Audience computation in BigQuery
- Real-time reads in Cloud Bigtable
Personas architecture
BigQuery +
Cloud Bigtable
- Use case
- Architecture
- Data model
- Query patterns
BigQuery + Cloud Bigtable
BigQuery:
Use case
BigQuery
Cloud
Bigtable
compute
service
Kafka Pubsub
Worker
Worker
- Want to find users who meet arbitrary criteria
- Terabytes of data within a few minutes
- Tables have billions of rows
- We rarely care about all of the columns
- Real-time reads are not a big deal
- Tens of concurrent queries
BigQuery: Use case
BigQuery
Cloud
Bigtable
compute
service
PubSub
Worker
Worker
BigQuery:
Architecture
2004: MapReduce
2010: Dremel (built in 2006)
BigQuery: architecture
- Designed to interactively query datasets (seconds-minutes)
- Nested, structured data
- Uses SQL, no programming
- Private version: Dremel
BigQuery Architecture:
four good ideas
BigQuery idea #1:
Column-oriented
Suppose we want to build a
database...
A row-oriented database
What if my database has
billions of rows...
...and I only need location?
What if my database has
billions of rows...
...and I only need location?
Store columns, not rows!
What if we invert the rows?
BigQuery idea #2:
Compression
BigQuery idea #2:
Compression
Columns on disk
- We have a lot of repeated data
- Run-length-encoding (RLE)
- Let’s compress it...
Columns on disk
- We have a lot of repeated data
- Run-length-encoding (RLE)
- Let’s compress it...
BigQuery idea #3:
Efficient nested decoding
BigQuery idea #3:
Efficient nested decoding
What happens when I
select *?
FSM
BigQuery idea #4:
More servers, more
efficiency
BigQuery idea #4:
More servers, more
efficiency
Root
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
query
Root
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
MERGE!
query
Root
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
Level 1
Level 1
Level 1
query
Root
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
Level 1
Level 1
Level 1
query
MERGE!
MERGE!
MERGE!
MERGE!
More servers ==
more distributed work
BigQuery’s good ideas
1. Column-oriented
2. Compression
3. Fast, nested, data encoding
4. Distribute the work (separate data + compute)
BigQuery:
Data model
We want to take user-
supplied criteria…
…and turn it into query
parameters
UI JSON
SQLJSON
BigQuery: Data Model
- Dataset per customer
- Table per {collection,event}
- Additional tables for traits,
identity, merges
BigQuery: Data Model
- Dataset per customer
- Table per {collection,event}
- Additional tables for traits,
identity, merges
- Repeated fields for
external_ids
BigQuery: Data Model
- Dataset per customer
- Table per {collection,event}
- Additional tables for traits,
identity, merges
- Repeated fields for
external_ids
- Explode arbitrary nested
properties
BigQuery:
Query patterns
BigQuery
Cloud
Bigtable
compute
service
Kafka Pubsub
Worker
Worker
Compute service runs queries every minute
Scan gigabytes in seconds
2GB/s scanned
(170T/day)
800 slots
- Tens of concurrent queries
- Scans terabytes of data independently
- Partitioned by customer
- Query by arrays of external_ids
- Stored AST as JSON and converted to SQL
Batch computations in BigQuery
Cloud Bigtable: Use case
BigQuery
Cloud
Bigtable
profile
API
Kafka Pubsub
Worker
Worker
Cloud Bigtable: Use case
- Small amounts of data (kb to mb)
- Able to be indexed for a single user
- A high read and write rate (tens of thousands of qps)
- Data should be reflected in real-time
Cloud Bigtable: Use case
- Small amounts of data (kb to mb)
- Able to be indexed for a single user
- A high read and write rate (tens of thousands of qps)
- Data should be reflected in real-time
(Not a new idea)
Cloud Bigtable:
Architecture
Bigtable (published in 2006)
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
write: <k, v>
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
write: <k, v>
memtable.append(k, v)
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
write: <k, v>
memtable.append(k, v)
append(k, v)
Writes are fast appends
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
read(k)
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
read(k)
memtable[k]
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
<value>
memtable[k]
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
read(k)
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
<value>
Reads first cache,
then merge
What about failures?
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
read(k)
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
read(k)
Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
read(k)
Cloud Bigtable: Architecture
- Multi-tenant
- Row-oriented
- Log-structured merge tree
- Immutable, with in-memory caching
- Bloom filters save on reads
- Lock service maps nodes to keyspace
Cloud Bigtable:
Data model
- Separate tables for different datatypes
- Records
- Properties
- Events
- Keys are ID and time-ordered
- Values are snappy-encoded
Cloud Bigtable: Data Model
Cloud Bigtable: Data Model
- Records provide metadata to
stitch together the full record
- User properties power the
profile API
- Events are sorted to query the
last range of events
Cloud Bigtable + BigQuery:
In production
In production
- Cloud Bigtable
- 55,000 rows written per second
- 175,000 rows read per second
- 10 TB of data
- 16 nodes
- BigQuery
- Hundreds of queries per minute
- Scanning hundreds of GB/minute
- 500TB worth of data stored
Back to that hard choice...
BigQuery is hard to compare
A few places
Cloud Bigtable shines
1. Identification of hot keys
2. Write-heavy workloads
Split compute
- Compute is separated
from storage
- Writes can be spread
across many nodes
In summary...
Segment Personas
- Powered by Cloud Bigtable and BigQuery
- Cloud Bigtable for small, random reads
- BigQuery for batch aggregations
- Processes billions of events
- Large, multi-tenant architecture
- SQL for flexible feature development
- Favorable read/write costs
- Millions of dollars in revenue
- Scales to Google-levels
Fin
Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar

Dbs302 driving a realtime personalization engine with cloud bigtable

Editor's Notes

  • #3 Situation: We all know that today, you have 3 choices in cloud providers. There’s: ( List - Eye Contact) AWS - the Market Leader Microsoft - thought to be the best Enterprise Solution Google - who’s built a reputation for having the best technology You’re making a hard choice. You have to pick one cloud provider to build your infrastructure upon. Complication: And you know when you make that choice, you’ll be locked in both by the switching costs and likely, a contract for a number of years. Implication: (List = Discovery) Making that choice is complicated. Your decision will have lasting ramifications on the cost, performance and potentially, even the ability to deliver a finished product. Position: (Storytelling) At Segment, we found ourselves in the same dilemma as you are, having to make this same choice in cloud provider about 1 ½ yrs ago. All of our infrastructure ran on AWS, but as we were introducing a new product, Personas, we started hitting the limits of what their data stores could offer. That’s when we started exploring the Google Cloud options. When we decided to build infrastructure in Google Cloud to process massive amounts of data, BigTable and BigQuery provided the key breakthroughs that enabled us to deliver the product successfully. Action: As I map out Segment’s Personas, I invite you to follow along and discover how we are able to process billions of events in real time. Benefit: Hopefully, our story will inspire you to consider building your own new products atop Google Cloud’s data infrastructure.
  • #79 Can add new queries/ui dynamically Totally robust No new backend infra