SlideShare a Scribd company logo
Cassandra for Barcodes, Products and Scans:
The Backend Infrastructure at Scandit
@scandit
www.scandit.com February 1, 2012
Christof Roduner
Co-founder and COO
christof@scandit.com
2
AGENDA
 About Scandit
 Requirements
 Apache Cassandra
 Scandit backend
3
WHAT IS SCANDIT?
Scandit provides developers best-in-class tools to
build, analyze and monetize product-centric apps.
ANALYZE
User Interest
MONETIZE
Apps
IDENTIFY
Products
4
IDENTIFY: BARCODE SCANNER
 Scandit SDK
 Fastest and most reliable barcode scanning technology for camera phones
 Available for all major platforms:
 iOS
 Android
 Symbian / Qt
 Phonegap
 Features:
 Scans from any angle
 Does not need autofocus
 Works with low-end cameras (→ Android, iPad2)
 Supports all barcode types (1D, 2D)
5
DEMO VIDEO
www.scandit.com/video
6
ANALYZE:
THE SCANALYTICS PLATFORM
 Tool for app publishers
 App-specific usage statistics
 Insights into consumer behavior:
 What do users scan?
 Product categories? Groceries, electronics, books, cosmetics, …?
 Where do users scan?
 At home? Or while in a retail store?
 Top products and brands
 Identify new opportunities:
 Customer engagement
 Product interest
 Cross-selling and up-selling
7
ANALYZE:
THE SCANALYTICS PLATFORM
8
ANALYZE:
THE SCANALYTICS PLATFORM
9
BACKEND REQUIREMENTS
 Product database
 Many millions of products
 Many different data sources
 Curation of product data (filtering, etc.)
 Analysis of scans
 Accept and store high volumes of scans
 Generate statistics over extended time periods
 Correlate with product data
 Provide reports to developers
10
BACKEND DESIGN GOALS
 Scalability
 High-volume storage
 High-volume throughput
 Support large number of concurrent client requests (app)
 Availability
 Low maintenance
11
WHICH DATABASE?
Apache Cassandra
 Large, distributed key-value store (DHT)
 «NoSQL»
 Inspired by:
 Amazon’s Dynamo distributed storage system
 Google’s BigTable data model
 Originally developed at Facebook
 Inbox search
12
WHY DID WE CHOOSE IT?
 Looked very fast
 Even when data is much larger than RAM
 Performs well in write-heavy environment
 Proven scalability
 Without downtime
 Tunable replication
 Easy to run and maintain
 No sharding
 All nodes are the same - no coordinators, masters, slaves, …
 Data model
 YMMV…
13
WHAT YOU HAVE TO GIVE UP
 Joins
 Referential integrity
 Transactions
 Expressive query language
 Consistency (tunable, but…)
 Limited support for:
 Schema
 Secondary indices
14
CASSANDRA DATA MODEL
 Column families
 Rows
 Columns
 (Supercolumns)
 We’ll skip them - Cassandra developers don’t like
them
Disclaimer: I tend to say «hash»
when I mean «dictionary, map,
associative array» (Can you tell
my favorite language?)
15
COLUMNS AND ROWS
 Column:
 Is a name-value pair
 Row:
 Has exactly one key
 Contains any number of columns
 Columns are always automatically sorted by their name
 Column family:
 A collection of any number of rows (!)
 Has a name
 «Like a table»
16
EXAMPLE COLUMN FAMILY
 A column family «users» containing two rows
 Columns can be different in every row
 First row has a column named «phone», second row does not
 Rows can have many columns
 You can add millions of them
"users": {
"christof": {
"email": "christof@scandit.com",
"phone": "123-456-7890"
}
"moritz": {
"email": "moritz@scandit.com",
"web": "www.example.com"
}
}
Row with key «christof»
Two columns, automatically
sorted by their names
(«email», «web»)
17
DATA IN COLUMN NAMES
 Column names can be used to store data
 Frequent pattern in Cassandra
 Takes advantage of column sorting
"logins": {
"christof": {
"2012-01-29 16:22:30 +0100": "208.115.113.86",
"2012-01-30 07:48:03 +0100": "66.249.66.183",
"2012-01-30 18:06:55 +0100": "208.115.111.70",
"2012-01-31 12:37:26 +0100": "66.249.66.183"
}
"moritz": {
"2012-01-23 01:12:49 +0100": "205.209.190.116"
}
}
18
SCHEMA AND DATA TYPES
 Schema is optional
 Data type can be defined for:
 Keys
 The values of all columns with a given name
 The column names in a CF
 By default, data type BLOB is used
 Data Types
 BLOB (default)
 ASCII text
 UTF8 text
 Timestamp
 Boolean
 UUID
 Integer (arbitrary length)
 Float
 Double
 Decimal
19
CLUSTER ORGANIZATION
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Range 1-64,
stored on node 2
Range 65-128,
stored on node 3
20
STORING A ROW
1. Calculate md5 hash for row key
Example: md5(“foobar") = 48
2. Determine data range for hash
Example: 48 lies within range 1-64
3. Store row on node responsible
for range
Example: store on node 2
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Range 1-64,
stored on node 2
Range 65-128,
stored on node 3
21
IMPLICATIONS
 Cluster automatically balanced
 Load is shared equally between nodes
 No hotspots
 Scaling out?
 Easy
 Divide data ranges by adding more nodes
 Cluster rebalances itself automatically
 Range queries not possible
 You can’t retrieve «all rows from A-C»
 Rows are not stored in their «natural» order
 Rows are stored in order of their md5 hashes
22
IF YOU NEED RANGE QUERIES…
Option 1: «Order Preserving Partitioner» (OPP)
 OPP determines node based on a row’s key instead of its hash
 Don’t use it…
 Manually balancing a cluster is hard
 Hotspots
 Balancing cluster for one column family creates hotspot for another
Option 2: Use columns instead of rows
 Columns are always sorted
 Rows can store millions of columns
23
REPLICATION
 Tunable replication factor
(RF)
 RF > 1: rows are automatically
replicated to next RF-1 nodes
 Tunable replication strategy
 «Ensure two replicas in
different data centers, racks,
etc.»
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Replica 1
of row
«foobar»
Replica 2
of row
«foobar»
24
CLIENT ACCESS
 Clients can send read and write
requests to any node
 This node will act as
coordinator
 Coordinator forwards request
to nodes where data resides
Node 3
Token 128
Node 2
Token 64
Node 4
Token 192
Node 1
Token 0
Client
Request:
insert(
"foobar": { "email": "fb@example.com" }
)
Replica 2
of row
«foobar»
Replica 1
of row
«foobar»
25
CONSISTENCY LEVELS
 For all requests, clients can set a consistency level (CL)
 For writes:
 CL defines how many replicas must be written before
«success» is returned to client
 For reads:
 CL defines how many replicas must respond before result is
returned to client
 Consistency levels:
 ONE
 QUORUM
 ALL
 … (data center-aware levels)
26
INCONSISTENT DATA
 Example scenario:
 Replication factor 2
 Two existing replica for row «foobar»
 Client overwrites existing columns in «foobar»
 Replica 2 is down
 What happens:
 Column is updated in replica 1, but not replica 2 (even with CL=ALL !)
 Timestamps to the rescue
 Every column has a timestamp
 Timestamps are supplied by clients
 Upon read, column with latest timestamp wins
 →Use NTP
27
PREVENTING INCONSISTENCIES
 Read repair
 Hinted handoff
 Anti entropy
28
RETRIEVING DATA (API)
 At a row level, you can…
 Get all rows
 Get a single row by specifying its key
 Get a number of rows by specifying their keys
 Get a range of rows
 Only with OPP, strongly discouraged
 At a column level, you can…
 Get all columns
 Get a single column by specifying its name
 Get a number of columns by specifying their names
 Get a range of columns by specifying the name of the first and
last column
 Again: no ranges of rows
29
CASSANDRA QUERY LANGUAGE
(CQL)
UPDATE users SET
"email" = "christof@scandit.com",
"phone" = "123-456-7890"
WHERE KEY = "christof";
"users": {
"christof": {
"email": "christof@scandit.com",
"phone": "123-456-7890"
}
"moritz": {
"email": "moritz@scandit.com",
"web": "www.example.com"
}
}
30
CASSANDRA QUERY LANGUAGE
(CQL)
SELECT * FROM users WHERE KEY = "christof";
"users": {
"christof": {
"email": "christof@scandit.com",
"phone": "123-456-7890"
}
"moritz": {
"email": "moritz@scandit.com",
"web": "www.example.com"
}
}
31
CASSANDRA QUERY LANGUAGE
(CQL)
SELECT "2012-01-30 00:00:00 +0100" ..
"2012-01-31 23:59:59 +0100"
FROM logins
WHERE KEY = "christof";
"logins": {
"christof": {
"2012-01-29 16:22:30 +0100": "208.115.113.86",
"2012-01-30 07:48:03 +0100": "66.249.66.183",
"2012-01-30 18:06:55 +0100": "208.115.111.70",
"2012-01-31 12:37:26 +0100": "66.249.66.183"
}
"moritz": {
"2012-01-23 01:12:49 +0100": "205.209.190.116"
}
}
32
SECONDARY INDICES
 Secondary indices can be defined for (single) columns
 Secondary indices only support equality predicate (=)
in queries
 Each node maintains index for data it owns
 When indexed column is queried, request must be forwarded
to all nodes
 Sometimes better to manually maintain your own index
33
PRODUCTION EXPERIENCE
 No stability issues
 Very fast
 Language bindings don’t have the same quality
 Out of sync, bugs
 Data model is a mental twist
 Design-time decisions sometimes hard to change
 Rudimentary access control
34
TRYING OUT CASSANDRA
 DataStax website
 Company founded by Cassandra developers
 Provides
 Documentation
 Amazon Machine Image
 Apache website
 Mailing lists
35
CLUSTER AT SCANDIT
 Several nodes in two data centers
 Linux machines
 Identical setup on every node
 Allows for easy failover
36
NODE ARCHITECTURE
Website & REST API
Ruby on Rails, Rack
to other nodes
frommobileappsandwebbrowsers
Phusion Passenger
mod_passenger
THANK YOU!
www.scandit.com

More Related Content

Similar to Netcetera

Cassandra 2012 scandit
Cassandra 2012 scanditCassandra 2012 scandit
Cassandra 2012 scandit
Charlie Zhu
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
rantav
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client Tutorial
Joe McTee
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
Victoria Malaya
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
ebenhewitt
 
Presentation
PresentationPresentation
Presentation
Dimitris Stripelis
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
DataStax
 
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
Andrew Liu
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
Brent Theisen
 
Software Security
Software SecuritySoftware Security
Software Security
Roman Oliynykov
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Hanborq Inc.
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
Andrey Lomakin
 
dfl
dfldfl
Networking Brush Up for Amazon AWS Administrators
Networking Brush Up for Amazon AWS AdministratorsNetworking Brush Up for Amazon AWS Administrators
Networking Brush Up for Amazon AWS Administrators
Aniekan Akpaffiong
 
Real-World Cassandra at ShareThis
Real-World Cassandra at ShareThisReal-World Cassandra at ShareThis
Real-World Cassandra at ShareThis
Juan Valencia
 
Cassandra20141009
Cassandra20141009Cassandra20141009
Cassandra20141009
Brian Enochson
 
DBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training PresentationsDBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training Presentations
Srinivas Mutyala
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Dave Gardner
 
network-addressing.pptx
network-addressing.pptxnetwork-addressing.pptx
network-addressing.pptx
FidaBaloch8
 
Introduction to security_and_crypto
Introduction to security_and_cryptoIntroduction to security_and_crypto
Introduction to security_and_crypto
Harry Potter
 

Similar to Netcetera (20)

Cassandra 2012 scandit
Cassandra 2012 scanditCassandra 2012 scandit
Cassandra 2012 scandit
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client Tutorial
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Presentation
PresentationPresentation
Presentation
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
 
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Software Security
Software SecuritySoftware Security
Software Security
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
dfl
dfldfl
dfl
 
Networking Brush Up for Amazon AWS Administrators
Networking Brush Up for Amazon AWS AdministratorsNetworking Brush Up for Amazon AWS Administrators
Networking Brush Up for Amazon AWS Administrators
 
Real-World Cassandra at ShareThis
Real-World Cassandra at ShareThisReal-World Cassandra at ShareThis
Real-World Cassandra at ShareThis
 
Cassandra20141009
Cassandra20141009Cassandra20141009
Cassandra20141009
 
DBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training PresentationsDBVersity MongoDB Online Training Presentations
DBVersity MongoDB Online Training Presentations
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
network-addressing.pptx
network-addressing.pptxnetwork-addressing.pptx
network-addressing.pptx
 
Introduction to security_and_crypto
Introduction to security_and_cryptoIntroduction to security_and_crypto
Introduction to security_and_crypto
 

Recently uploaded

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 

Recently uploaded (20)

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 

Netcetera

  • 1. Cassandra for Barcodes, Products and Scans: The Backend Infrastructure at Scandit @scandit www.scandit.com February 1, 2012 Christof Roduner Co-founder and COO christof@scandit.com
  • 2. 2 AGENDA  About Scandit  Requirements  Apache Cassandra  Scandit backend
  • 3. 3 WHAT IS SCANDIT? Scandit provides developers best-in-class tools to build, analyze and monetize product-centric apps. ANALYZE User Interest MONETIZE Apps IDENTIFY Products
  • 4. 4 IDENTIFY: BARCODE SCANNER  Scandit SDK  Fastest and most reliable barcode scanning technology for camera phones  Available for all major platforms:  iOS  Android  Symbian / Qt  Phonegap  Features:  Scans from any angle  Does not need autofocus  Works with low-end cameras (→ Android, iPad2)  Supports all barcode types (1D, 2D)
  • 6. 6 ANALYZE: THE SCANALYTICS PLATFORM  Tool for app publishers  App-specific usage statistics  Insights into consumer behavior:  What do users scan?  Product categories? Groceries, electronics, books, cosmetics, …?  Where do users scan?  At home? Or while in a retail store?  Top products and brands  Identify new opportunities:  Customer engagement  Product interest  Cross-selling and up-selling
  • 9. 9 BACKEND REQUIREMENTS  Product database  Many millions of products  Many different data sources  Curation of product data (filtering, etc.)  Analysis of scans  Accept and store high volumes of scans  Generate statistics over extended time periods  Correlate with product data  Provide reports to developers
  • 10. 10 BACKEND DESIGN GOALS  Scalability  High-volume storage  High-volume throughput  Support large number of concurrent client requests (app)  Availability  Low maintenance
  • 11. 11 WHICH DATABASE? Apache Cassandra  Large, distributed key-value store (DHT)  «NoSQL»  Inspired by:  Amazon’s Dynamo distributed storage system  Google’s BigTable data model  Originally developed at Facebook  Inbox search
  • 12. 12 WHY DID WE CHOOSE IT?  Looked very fast  Even when data is much larger than RAM  Performs well in write-heavy environment  Proven scalability  Without downtime  Tunable replication  Easy to run and maintain  No sharding  All nodes are the same - no coordinators, masters, slaves, …  Data model  YMMV…
  • 13. 13 WHAT YOU HAVE TO GIVE UP  Joins  Referential integrity  Transactions  Expressive query language  Consistency (tunable, but…)  Limited support for:  Schema  Secondary indices
  • 14. 14 CASSANDRA DATA MODEL  Column families  Rows  Columns  (Supercolumns)  We’ll skip them - Cassandra developers don’t like them Disclaimer: I tend to say «hash» when I mean «dictionary, map, associative array» (Can you tell my favorite language?)
  • 15. 15 COLUMNS AND ROWS  Column:  Is a name-value pair  Row:  Has exactly one key  Contains any number of columns  Columns are always automatically sorted by their name  Column family:  A collection of any number of rows (!)  Has a name  «Like a table»
  • 16. 16 EXAMPLE COLUMN FAMILY  A column family «users» containing two rows  Columns can be different in every row  First row has a column named «phone», second row does not  Rows can have many columns  You can add millions of them "users": { "christof": { "email": "christof@scandit.com", "phone": "123-456-7890" } "moritz": { "email": "moritz@scandit.com", "web": "www.example.com" } } Row with key «christof» Two columns, automatically sorted by their names («email», «web»)
  • 17. 17 DATA IN COLUMN NAMES  Column names can be used to store data  Frequent pattern in Cassandra  Takes advantage of column sorting "logins": { "christof": { "2012-01-29 16:22:30 +0100": "208.115.113.86", "2012-01-30 07:48:03 +0100": "66.249.66.183", "2012-01-30 18:06:55 +0100": "208.115.111.70", "2012-01-31 12:37:26 +0100": "66.249.66.183" } "moritz": { "2012-01-23 01:12:49 +0100": "205.209.190.116" } }
  • 18. 18 SCHEMA AND DATA TYPES  Schema is optional  Data type can be defined for:  Keys  The values of all columns with a given name  The column names in a CF  By default, data type BLOB is used  Data Types  BLOB (default)  ASCII text  UTF8 text  Timestamp  Boolean  UUID  Integer (arbitrary length)  Float  Double  Decimal
  • 19. 19 CLUSTER ORGANIZATION Node 3 Token 128 Node 2 Token 64 Node 4 Token 192 Node 1 Token 0 Range 1-64, stored on node 2 Range 65-128, stored on node 3
  • 20. 20 STORING A ROW 1. Calculate md5 hash for row key Example: md5(“foobar") = 48 2. Determine data range for hash Example: 48 lies within range 1-64 3. Store row on node responsible for range Example: store on node 2 Node 3 Token 128 Node 2 Token 64 Node 4 Token 192 Node 1 Token 0 Range 1-64, stored on node 2 Range 65-128, stored on node 3
  • 21. 21 IMPLICATIONS  Cluster automatically balanced  Load is shared equally between nodes  No hotspots  Scaling out?  Easy  Divide data ranges by adding more nodes  Cluster rebalances itself automatically  Range queries not possible  You can’t retrieve «all rows from A-C»  Rows are not stored in their «natural» order  Rows are stored in order of their md5 hashes
  • 22. 22 IF YOU NEED RANGE QUERIES… Option 1: «Order Preserving Partitioner» (OPP)  OPP determines node based on a row’s key instead of its hash  Don’t use it…  Manually balancing a cluster is hard  Hotspots  Balancing cluster for one column family creates hotspot for another Option 2: Use columns instead of rows  Columns are always sorted  Rows can store millions of columns
  • 23. 23 REPLICATION  Tunable replication factor (RF)  RF > 1: rows are automatically replicated to next RF-1 nodes  Tunable replication strategy  «Ensure two replicas in different data centers, racks, etc.» Node 3 Token 128 Node 2 Token 64 Node 4 Token 192 Node 1 Token 0 Replica 1 of row «foobar» Replica 2 of row «foobar»
  • 24. 24 CLIENT ACCESS  Clients can send read and write requests to any node  This node will act as coordinator  Coordinator forwards request to nodes where data resides Node 3 Token 128 Node 2 Token 64 Node 4 Token 192 Node 1 Token 0 Client Request: insert( "foobar": { "email": "fb@example.com" } ) Replica 2 of row «foobar» Replica 1 of row «foobar»
  • 25. 25 CONSISTENCY LEVELS  For all requests, clients can set a consistency level (CL)  For writes:  CL defines how many replicas must be written before «success» is returned to client  For reads:  CL defines how many replicas must respond before result is returned to client  Consistency levels:  ONE  QUORUM  ALL  … (data center-aware levels)
  • 26. 26 INCONSISTENT DATA  Example scenario:  Replication factor 2  Two existing replica for row «foobar»  Client overwrites existing columns in «foobar»  Replica 2 is down  What happens:  Column is updated in replica 1, but not replica 2 (even with CL=ALL !)  Timestamps to the rescue  Every column has a timestamp  Timestamps are supplied by clients  Upon read, column with latest timestamp wins  →Use NTP
  • 27. 27 PREVENTING INCONSISTENCIES  Read repair  Hinted handoff  Anti entropy
  • 28. 28 RETRIEVING DATA (API)  At a row level, you can…  Get all rows  Get a single row by specifying its key  Get a number of rows by specifying their keys  Get a range of rows  Only with OPP, strongly discouraged  At a column level, you can…  Get all columns  Get a single column by specifying its name  Get a number of columns by specifying their names  Get a range of columns by specifying the name of the first and last column  Again: no ranges of rows
  • 29. 29 CASSANDRA QUERY LANGUAGE (CQL) UPDATE users SET "email" = "christof@scandit.com", "phone" = "123-456-7890" WHERE KEY = "christof"; "users": { "christof": { "email": "christof@scandit.com", "phone": "123-456-7890" } "moritz": { "email": "moritz@scandit.com", "web": "www.example.com" } }
  • 30. 30 CASSANDRA QUERY LANGUAGE (CQL) SELECT * FROM users WHERE KEY = "christof"; "users": { "christof": { "email": "christof@scandit.com", "phone": "123-456-7890" } "moritz": { "email": "moritz@scandit.com", "web": "www.example.com" } }
  • 31. 31 CASSANDRA QUERY LANGUAGE (CQL) SELECT "2012-01-30 00:00:00 +0100" .. "2012-01-31 23:59:59 +0100" FROM logins WHERE KEY = "christof"; "logins": { "christof": { "2012-01-29 16:22:30 +0100": "208.115.113.86", "2012-01-30 07:48:03 +0100": "66.249.66.183", "2012-01-30 18:06:55 +0100": "208.115.111.70", "2012-01-31 12:37:26 +0100": "66.249.66.183" } "moritz": { "2012-01-23 01:12:49 +0100": "205.209.190.116" } }
  • 32. 32 SECONDARY INDICES  Secondary indices can be defined for (single) columns  Secondary indices only support equality predicate (=) in queries  Each node maintains index for data it owns  When indexed column is queried, request must be forwarded to all nodes  Sometimes better to manually maintain your own index
  • 33. 33 PRODUCTION EXPERIENCE  No stability issues  Very fast  Language bindings don’t have the same quality  Out of sync, bugs  Data model is a mental twist  Design-time decisions sometimes hard to change  Rudimentary access control
  • 34. 34 TRYING OUT CASSANDRA  DataStax website  Company founded by Cassandra developers  Provides  Documentation  Amazon Machine Image  Apache website  Mailing lists
  • 35. 35 CLUSTER AT SCANDIT  Several nodes in two data centers  Linux machines  Identical setup on every node  Allows for easy failover
  • 36. 36 NODE ARCHITECTURE Website & REST API Ruby on Rails, Rack to other nodes frommobileappsandwebbrowsers Phusion Passenger mod_passenger

Editor's Notes

  1. ETH Zurichspin-offcompanyFoundedbythreeformerPhDstudentsfrom ETH Zurichand MITMission: Provide mobile appdeveloperswithtoolstobuild…Atthecenterofourbusiness:Barcode scanningalgorithmdevelopedat ETH ZurichSDKHow is it different from Zxing, Zbar, etc.?All platformsLow-end AndroidphonesiPad2Faster (beforeautofocustriggers)Dynamic range (handlesclosecodeswell)