Netcetera

Cassandra for Barcodes, Products and Scans:
The Backend Infrastructure at Scandit
@scandit
www.scandit.com February 1, 2012
Christof Roduner
Co-founder and COO
christof@scandit.com

2
AGENDA
 About Scandit
 Requirements
 Apache Cassandra
 Scandit backend

3
WHAT IS SCANDIT?
Scandit provides developers best-in-class tools to
build, analyze and monetize product-centric apps.
ANALYZE
User Interest
MONETIZE
Apps
IDENTIFY
Products

4
IDENTIFY: BARCODE SCANNER
 Scandit SDK
 Fastest and most reliable barcode scanning technology for camera phones
 Available for all major platforms:
 iOS
 Android
 Symbian / Qt
 Phonegap
 Features:
 Scans from any angle
 Does not need autofocus
 Works with low-end cameras (→ Android, iPad2)
 Supports all barcode types (1D, 2D)

5
DEMO VIDEO
www.scandit.com/video

6
ANALYZE:
THE SCANALYTICS PLATFORM
 Tool for app publishers
 App-specific usage statistics
 Insights into consumer behavior:
 What do users scan?
 Product categories? Groceries, electronics, books, cosmetics, …?
 Where do users scan?
 At home? Or while in a retail store?
 Top products and brands
 Identify new opportunities:
 Customer engagement
 Product interest
 Cross-selling and up-selling

7
ANALYZE:

8
ANALYZE:

9
BACKEND REQUIREMENTS
 Product database
 Many millions of products
 Many different data sources
 Curation of product data (filtering, etc.)
 Analysis of scans
 Accept and store high volumes of scans
 Generate statistics over extended time periods
 Correlate with product data
 Provide reports to developers

10
BACKEND DESIGN GOALS
 Scalability
 High-volume storage
 High-volume throughput
 Support large number of concurrent client requests (app)
 Availability
 Low maintenance

11
WHICH DATABASE?
Apache Cassandra
 Large, distributed key-value store (DHT)
 «NoSQL»
 Inspired by:
 Amazon’s Dynamo distributed storage system
 Google’s BigTable data model
 Originally developed at Facebook
 Inbox search

12
WHY DID WE CHOOSE IT?
 Looked very fast
 Even when data is much larger than RAM
 Performs well in write-heavy environment
 Proven scalability
 Without downtime
 Tunable replication
 Easy to run and maintain
 No sharding
 All nodes are the same - no coordinators, masters, slaves, …
 Data model
 YMMV…

13
WHAT YOU HAVE TO GIVE UP
 Joins
 Referential integrity
 Transactions
 Expressive query language
 Consistency (tunable, but…)
 Limited support for:
 Schema
 Secondary indices

14
CASSANDRA DATA MODEL
 Column families
 Rows
 Columns
 (Supercolumns)
 We’ll skip them - Cassandra developers don’t like
them
Disclaimer: I tend to say «hash»
when I mean «dictionary, map,
associative array» (Can you tell
my favorite language?)

15
COLUMNS AND ROWS
 Column:
 Is a name-value pair
 Row:
 Has exactly one key
 Contains any number of columns
 Columns are always automatically sorted by their name
 Column family:
 A collection of any number of rows (!)
 Has a name
 «Like a table»

16
EXAMPLE COLUMN FAMILY
 A column family «users» containing two rows
 Columns can be different in every row
 First row has a column named «phone», second row does not
 Rows can have many columns
 You can add millions of them
"users": {
"christof": {
"email": "christof@scandit.com",
"phone": "123-456-7890"
}
"moritz": {
"email": "moritz@scandit.com",
"web": "www.example.com"
}
}
Row with key «christof»
Two columns, automatically
sorted by their names
(«email», «web»)

17
DATA IN COLUMN NAMES
 Column names can be used to store data
 Frequent pattern in Cassandra
 Takes advantage of column sorting
"logins": {
"christof": {
"2012-01-29 16:22:30 +0100": "208.115.113.86",
"2012-01-30 07:48:03 +0100": "66.249.66.183",
"2012-01-30 18:06:55 +0100": "208.115.111.70",
"2012-01-31 12:37:26 +0100": "66.249.66.183"
}
"moritz": {
"2012-01-23 01:12:49 +0100": "205.209.190.116"
}
}

18
SCHEMA AND DATA TYPES
 Schema is optional
 Data type can be defined for:
 Keys
 The values of all columns with a given name
 The column names in a CF
 By default, data type BLOB is used
 Data Types
 BLOB (default)
 ASCII text
 UTF8 text
 Timestamp
 Boolean
 UUID
 Integer (arbitrary length)
 Float
 Double
 Decimal

19
CLUSTER ORGANIZATION
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Range 1-64,
stored on node 2
Range 65-128,
stored on node 3

20
STORING A ROW
1. Calculate md5 hash for row key
Example: md5(“foobar") = 48
2. Determine data range for hash
Example: 48 lies within range 1-64
3. Store row on node responsible
for range
Example: store on node 2
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Range 1-64,
stored on node 2
Range 65-128,
stored on node 3

21
IMPLICATIONS
 Cluster automatically balanced
 Load is shared equally between nodes
 No hotspots
 Scaling out?
 Easy
 Divide data ranges by adding more nodes
 Cluster rebalances itself automatically
 Range queries not possible
 You can’t retrieve «all rows from A-C»
 Rows are not stored in their «natural» order
 Rows are stored in order of their md5 hashes

22
IF YOU NEED RANGE QUERIES…
Option 1: «Order Preserving Partitioner» (OPP)
 OPP determines node based on a row’s key instead of its hash
 Don’t use it…
 Manually balancing a cluster is hard
 Hotspots
 Balancing cluster for one column family creates hotspot for another
Option 2: Use columns instead of rows
 Columns are always sorted
 Rows can store millions of columns

23
REPLICATION
 Tunable replication factor
(RF)
 RF > 1: rows are automatically
replicated to next RF-1 nodes
 Tunable replication strategy
 «Ensure two replicas in
different data centers, racks,
etc.»
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Replica 1
of row
«foobar»
Replica 2
of row
«foobar»

24
CLIENT ACCESS
 Clients can send read and write
requests to any node
 This node will act as
coordinator
 Coordinator forwards request
to nodes where data resides
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Client
Request:
insert(
"foobar": { "email": "fb@example.com" }
)
Replica 2
of row
«foobar»
Replica 1
of row
«foobar»

25
CONSISTENCY LEVELS
 For all requests, clients can set a consistency level (CL)
 For writes:
 CL defines how many replicas must be written before
«success» is returned to client
 For reads:
 CL defines how many replicas must respond before result is
returned to client
 Consistency levels:
 ONE
 QUORUM
 ALL
 … (data center-aware levels)

26
INCONSISTENT DATA
 Example scenario:
 Replication factor 2
 Two existing replica for row «foobar»
 Client overwrites existing columns in «foobar»
 Replica 2 is down
 What happens:
 Column is updated in replica 1, but not replica 2 (even with CL=ALL !)
 Timestamps to the rescue
 Every column has a timestamp
 Timestamps are supplied by clients
 Upon read, column with latest timestamp wins
 →Use NTP

27
PREVENTING INCONSISTENCIES
 Read repair
 Hinted handoff
 Anti entropy

28
RETRIEVING DATA (API)
 At a row level, you can…
 Get all rows
 Get a single row by specifying its key
 Get a number of rows by specifying their keys
 Get a range of rows
 Only with OPP, strongly discouraged
 At a column level, you can…
 Get all columns
 Get a single column by specifying its name
 Get a number of columns by specifying their names
 Get a range of columns by specifying the name of the first and
last column
 Again: no ranges of rows

29
CASSANDRA QUERY LANGUAGE
(CQL)
UPDATE users SET
"email" = "christof@scandit.com",
"phone" = "123-456-7890"
WHERE KEY = "christof";
"users": {
"christof": {
"phone": "123-456-7890"
}
"moritz": {
}
}

30
(CQL)
SELECT * FROM users WHERE KEY = "christof";
"users": {
"christof": {
"phone": "123-456-7890"
}
"moritz": {
}
}

31
(CQL)
SELECT "2012-01-30 00:00:00 +0100" ..
"2012-01-31 23:59:59 +0100"
FROM logins
WHERE KEY = "christof";
"logins": {
"christof": {
"2012-01-29 16:22:30 +0100": "208.115.113.86",
"2012-01-30 07:48:03 +0100": "66.249.66.183",
"2012-01-30 18:06:55 +0100": "208.115.111.70",
"2012-01-31 12:37:26 +0100": "66.249.66.183"
}
"moritz": {
"2012-01-23 01:12:49 +0100": "205.209.190.116"
}
}

32
SECONDARY INDICES
 Secondary indices can be defined for (single) columns
 Secondary indices only support equality predicate (=)
in queries
 Each node maintains index for data it owns
 When indexed column is queried, request must be forwarded
to all nodes
 Sometimes better to manually maintain your own index

33
PRODUCTION EXPERIENCE
 No stability issues
 Very fast
 Language bindings don’t have the same quality
 Out of sync, bugs
 Data model is a mental twist
 Design-time decisions sometimes hard to change
 Rudimentary access control

34
TRYING OUT CASSANDRA
 DataStax website
 Company founded by Cassandra developers
 Provides
 Documentation
 Amazon Machine Image
 Apache website
 Mailing lists

35
CLUSTER AT SCANDIT
 Several nodes in two data centers
 Linux machines
 Identical setup on every node
 Allows for easy failover

36
NODE ARCHITECTURE
Website & REST API
Ruby on Rails, Rack
to other nodes
frommobileappsandwebbrowsers
Phusion Passenger
mod_passenger

Netcetera

Recommended

Recommended

More Related Content

Similar to Netcetera

Similar to Netcetera (20)

Recently uploaded

Recently uploaded (20)

Netcetera

Editor's Notes