The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Netcetera
1. Cassandra for Barcodes, Products and Scans:
The Backend Infrastructure at Scandit
@scandit
www.scandit.com February 1, 2012
Christof Roduner
Co-founder and COO
christof@scandit.com
3. 3
WHAT IS SCANDIT?
Scandit provides developers best-in-class tools to
build, analyze and monetize product-centric apps.
ANALYZE
User Interest
MONETIZE
Apps
IDENTIFY
Products
4. 4
IDENTIFY: BARCODE SCANNER
Scandit SDK
Fastest and most reliable barcode scanning technology for camera phones
Available for all major platforms:
iOS
Android
Symbian / Qt
Phonegap
Features:
Scans from any angle
Does not need autofocus
Works with low-end cameras (→ Android, iPad2)
Supports all barcode types (1D, 2D)
6. 6
ANALYZE:
THE SCANALYTICS PLATFORM
Tool for app publishers
App-specific usage statistics
Insights into consumer behavior:
What do users scan?
Product categories? Groceries, electronics, books, cosmetics, …?
Where do users scan?
At home? Or while in a retail store?
Top products and brands
Identify new opportunities:
Customer engagement
Product interest
Cross-selling and up-selling
9. 9
BACKEND REQUIREMENTS
Product database
Many millions of products
Many different data sources
Curation of product data (filtering, etc.)
Analysis of scans
Accept and store high volumes of scans
Generate statistics over extended time periods
Correlate with product data
Provide reports to developers
10. 10
BACKEND DESIGN GOALS
Scalability
High-volume storage
High-volume throughput
Support large number of concurrent client requests (app)
Availability
Low maintenance
11. 11
WHICH DATABASE?
Apache Cassandra
Large, distributed key-value store (DHT)
«NoSQL»
Inspired by:
Amazon’s Dynamo distributed storage system
Google’s BigTable data model
Originally developed at Facebook
Inbox search
12. 12
WHY DID WE CHOOSE IT?
Looked very fast
Even when data is much larger than RAM
Performs well in write-heavy environment
Proven scalability
Without downtime
Tunable replication
Easy to run and maintain
No sharding
All nodes are the same - no coordinators, masters, slaves, …
Data model
YMMV…
13. 13
WHAT YOU HAVE TO GIVE UP
Joins
Referential integrity
Transactions
Expressive query language
Consistency (tunable, but…)
Limited support for:
Schema
Secondary indices
14. 14
CASSANDRA DATA MODEL
Column families
Rows
Columns
(Supercolumns)
We’ll skip them - Cassandra developers don’t like
them
Disclaimer: I tend to say «hash»
when I mean «dictionary, map,
associative array» (Can you tell
my favorite language?)
15. 15
COLUMNS AND ROWS
Column:
Is a name-value pair
Row:
Has exactly one key
Contains any number of columns
Columns are always automatically sorted by their name
Column family:
A collection of any number of rows (!)
Has a name
«Like a table»
16. 16
EXAMPLE COLUMN FAMILY
A column family «users» containing two rows
Columns can be different in every row
First row has a column named «phone», second row does not
Rows can have many columns
You can add millions of them
"users": {
"christof": {
"email": "christof@scandit.com",
"phone": "123-456-7890"
}
"moritz": {
"email": "moritz@scandit.com",
"web": "www.example.com"
}
}
Row with key «christof»
Two columns, automatically
sorted by their names
(«email», «web»)
17. 17
DATA IN COLUMN NAMES
Column names can be used to store data
Frequent pattern in Cassandra
Takes advantage of column sorting
"logins": {
"christof": {
"2012-01-29 16:22:30 +0100": "208.115.113.86",
"2012-01-30 07:48:03 +0100": "66.249.66.183",
"2012-01-30 18:06:55 +0100": "208.115.111.70",
"2012-01-31 12:37:26 +0100": "66.249.66.183"
}
"moritz": {
"2012-01-23 01:12:49 +0100": "205.209.190.116"
}
}
18. 18
SCHEMA AND DATA TYPES
Schema is optional
Data type can be defined for:
Keys
The values of all columns with a given name
The column names in a CF
By default, data type BLOB is used
Data Types
BLOB (default)
ASCII text
UTF8 text
Timestamp
Boolean
UUID
Integer (arbitrary length)
Float
Double
Decimal
19. 19
CLUSTER ORGANIZATION
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Range 1-64,
stored on node 2
Range 65-128,
stored on node 3
20. 20
STORING A ROW
1. Calculate md5 hash for row key
Example: md5(“foobar") = 48
2. Determine data range for hash
Example: 48 lies within range 1-64
3. Store row on node responsible
for range
Example: store on node 2
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Range 1-64,
stored on node 2
Range 65-128,
stored on node 3
21. 21
IMPLICATIONS
Cluster automatically balanced
Load is shared equally between nodes
No hotspots
Scaling out?
Easy
Divide data ranges by adding more nodes
Cluster rebalances itself automatically
Range queries not possible
You can’t retrieve «all rows from A-C»
Rows are not stored in their «natural» order
Rows are stored in order of their md5 hashes
22. 22
IF YOU NEED RANGE QUERIES…
Option 1: «Order Preserving Partitioner» (OPP)
OPP determines node based on a row’s key instead of its hash
Don’t use it…
Manually balancing a cluster is hard
Hotspots
Balancing cluster for one column family creates hotspot for another
Option 2: Use columns instead of rows
Columns are always sorted
Rows can store millions of columns
23. 23
REPLICATION
Tunable replication factor
(RF)
RF > 1: rows are automatically
replicated to next RF-1 nodes
Tunable replication strategy
«Ensure two replicas in
different data centers, racks,
etc.»
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Replica 1
of row
«foobar»
Replica 2
of row
«foobar»
24. 24
CLIENT ACCESS
Clients can send read and write
requests to any node
This node will act as
coordinator
Coordinator forwards request
to nodes where data resides
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Client
Request:
insert(
"foobar": { "email": "fb@example.com" }
)
Replica 2
of row
«foobar»
Replica 1
of row
«foobar»
25. 25
CONSISTENCY LEVELS
For all requests, clients can set a consistency level (CL)
For writes:
CL defines how many replicas must be written before
«success» is returned to client
For reads:
CL defines how many replicas must respond before result is
returned to client
Consistency levels:
ONE
QUORUM
ALL
… (data center-aware levels)
26. 26
INCONSISTENT DATA
Example scenario:
Replication factor 2
Two existing replica for row «foobar»
Client overwrites existing columns in «foobar»
Replica 2 is down
What happens:
Column is updated in replica 1, but not replica 2 (even with CL=ALL !)
Timestamps to the rescue
Every column has a timestamp
Timestamps are supplied by clients
Upon read, column with latest timestamp wins
→Use NTP
28. 28
RETRIEVING DATA (API)
At a row level, you can…
Get all rows
Get a single row by specifying its key
Get a number of rows by specifying their keys
Get a range of rows
Only with OPP, strongly discouraged
At a column level, you can…
Get all columns
Get a single column by specifying its name
Get a number of columns by specifying their names
Get a range of columns by specifying the name of the first and
last column
Again: no ranges of rows
32. 32
SECONDARY INDICES
Secondary indices can be defined for (single) columns
Secondary indices only support equality predicate (=)
in queries
Each node maintains index for data it owns
When indexed column is queried, request must be forwarded
to all nodes
Sometimes better to manually maintain your own index
33. 33
PRODUCTION EXPERIENCE
No stability issues
Very fast
Language bindings don’t have the same quality
Out of sync, bugs
Data model is a mental twist
Design-time decisions sometimes hard to change
Rudimentary access control
34. 34
TRYING OUT CASSANDRA
DataStax website
Company founded by Cassandra developers
Provides
Documentation
Amazon Machine Image
Apache website
Mailing lists
35. 35
CLUSTER AT SCANDIT
Several nodes in two data centers
Linux machines
Identical setup on every node
Allows for easy failover
36. 36
NODE ARCHITECTURE
Website & REST API
Ruby on Rails, Rack
to other nodes
frommobileappsandwebbrowsers
Phusion Passenger
mod_passenger
ETH Zurichspin-offcompanyFoundedbythreeformerPhDstudentsfrom ETH Zurichand MITMission: Provide mobile appdeveloperswithtoolstobuild…Atthecenterofourbusiness:Barcode scanningalgorithmdevelopedat ETH ZurichSDKHow is it different from Zxing, Zbar, etc.?All platformsLow-end AndroidphonesiPad2Faster (beforeautofocustriggers)Dynamic range (handlesclosecodeswell)