Dbs302 driving a realtime personalization engine with cloud bigtable

DBS302:
Driving a Realtime
Personalization Engine
with Cloud Bigtable
Calvin French-Owen, Co-Founder & CTO, Segment

You’re making a hard
choice...

Our roadmap
- A bit of background
- Personas architecture
- BigQuery + Cloud Bigtable
- Making hard choices

- 19,000 users
- 300B monthly events
- 450B outbound API calls
- TB of data per day
Segment by the numbers

API Kafka Consumer
DB
api.google.com
api.salesforce.com
api.intercom.io
api.mixpanel.com

The biggest advantage of
this system

The biggest advantage of
this system
It’s stateless

API Kafka Consumer
DB
api.google.com
api.salesforce.com
api.intercom.io
api.mixpanel.comAPI Kafka Consumer

In 2018... we started
getting a new set of
requirements

Personas brought some
decidedly stateful use
cases

- Query profiles in real-time
- Match users by identity
- Create audiences of users
Personas

Let’s first talk about
lambda architectures...

- Data is sent to the batch and speed layers
- Batch layers runs bigger computations
- Speed layer serves real-time updates (+ diffs)
Lambda architecture

- Query profiles in real-time (speed)
- Match users by identity (speed)
- Create audiences of users (batch)
Personas

Different pipelines,
different datastores

Kafka Pubsub
BigQuery
Cloud
Bigtable
(batch)
(speed)
Worker
Worker

Segment messages
- Tracking things like pageviews, user
events, etc
- Semi-structured JSON
- Typically ~1kb

- Hundreds of thousands of 1kb messages
- Published from Kafka to Cloud PubSub
- Writes data twice, once for realtime, once for batch
- Audience computation in BigQuery
- Real-time reads in Cloud Bigtable
Personas architecture

- Use case
- Architecture
- Data model
- Query patterns
BigQuery + Cloud Bigtable

BigQuery
Cloud
Bigtable
compute
service
Kafka Pubsub
Worker
Worker

- Want to find users who meet arbitrary criteria
- Terabytes of data within a few minutes
- Tables have billions of rows
- We rarely care about all of the columns
- Real-time reads are not a big deal
- Tens of concurrent queries
BigQuery: Use case
BigQuery
Cloud
Bigtable
compute
service
PubSub
Worker
Worker

BigQuery: architecture
- Designed to interactively query datasets (seconds-minutes)
- Nested, structured data
- Uses SQL, no programming
- Private version: Dremel

BigQuery Architecture:
four good ideas

BigQuery idea #1:
Column-oriented

Suppose we want to build a
database...

What if my database has
billions of rows...
...and I only need location?

What if my database has
billions of rows...
...and I only need location?
Store columns, not rows!

Columns on disk
- We have a lot of repeated data
- Run-length-encoding (RLE)
- Let’s compress it...

BigQuery idea #3:
Efficient nested decoding

BigQuery idea #4:
More servers, more
efficiency

Root
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
query

Root
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
MERGE!
query

Root
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
Level 1
Level 1
Level 1
query

Root
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
leaflet
Level 1
Level 1
Level 1
query
MERGE!
MERGE!
MERGE!
MERGE!

More servers ==
more distributed work

BigQuery’s good ideas
1. Column-oriented
2. Compression
3. Fast, nested, data encoding
4. Distribute the work (separate data + compute)

We want to take user-
supplied criteria…
…and turn it into query
parameters

BigQuery: Data Model
- Dataset per customer
- Table per {collection,event}
- Additional tables for traits,
identity, merges

identity, merges
- Repeated fields for
external_ids

identity, merges
- Repeated fields for
external_ids
- Explode arbitrary nested
properties

Compute service runs queries every minute

2GB/s scanned
(170T/day)
800 slots

- Tens of concurrent queries
- Scans terabytes of data independently
- Partitioned by customer
- Query by arrays of external_ids
- Stored AST as JSON and converted to SQL
Batch computations in BigQuery

BigQuery
Cloud
Bigtable
profile
API
Kafka Pubsub
Worker
Worker

Cloud Bigtable: Use case
- Small amounts of data (kb to mb)
- Able to be indexed for a single user
- A high read and write rate (tens of thousands of qps)
- Data should be reflected in real-time

Cloud Bigtable: Use case
- Small amounts of data (kb to mb)
- Able to be indexed for a single user
- A high read and write rate (tens of thousands of qps)
- Data should be reflected in real-time
(Not a new idea)

Cloud Bigtable: Architecture
Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet

Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
write: <k, v>

Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
write: <k, v>
memtable.append(k, v)

Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
write: <k, v>
memtable.append(k, v)
append(k, v)

Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
read(k)

Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
read(k)
memtable[k]

Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
<value>
memtable[k]

Client
BT Node
GFS Tablet
GFS Tablet
GFS Tablet
memtable
BT Node
memtable
GFS Tablet
<value>

- Multi-tenant
- Row-oriented
- Log-structured merge tree
- Immutable, with in-memory caching
- Bloom filters save on reads
- Lock service maps nodes to keyspace

- Separate tables for different datatypes
- Records
- Properties
- Events
- Keys are ID and time-ordered
- Values are snappy-encoded
Cloud Bigtable: Data Model

Cloud Bigtable: Data Model
- Records provide metadata to
stitch together the full record
- User properties power the
profile API
- Events are sorted to query the
last range of events

Cloud Bigtable + BigQuery:
In production

In production
- Cloud Bigtable
- 55,000 rows written per second
- 175,000 rows read per second
- 10 TB of data
- 16 nodes
- BigQuery
- Hundreds of queries per minute
- Scanning hundreds of GB/minute
- 500TB worth of data stored

A few places
Cloud Bigtable shines

Split compute
- Compute is separated
from storage
- Writes can be spread
across many nodes

Segment Personas
- Powered by Cloud Bigtable and BigQuery
- Cloud Bigtable for small, random reads
- BigQuery for batch aggregations
- Processes billions of events
- Large, multi-tenant architecture
- SQL for flexible feature development
- Favorable read/write costs
- Millions of dollars in revenue
- Scales to Google-levels

Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar

Dbs302 driving a realtime personalization engine with cloud bigtable

More Related Content

What's hot

Similar to Dbs302 driving a realtime personalization engine with cloud bigtable

More from Calvin French-Owen

Recently uploaded

Dbs302 driving a realtime personalization engine with cloud bigtable

Editor's Notes