Netcetera

2,177 views

Published on

A brief presentation held by Christof from Scandit at Necetera in Zurich, Switzerland, about the Cassandra and how it's being used at Scandit.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,177
On SlideShare
0
From Embeds
0
Number of Embeds
681
Actions
Shares
0
Downloads
14
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • ETH Zurichspin-offcompanyFoundedbythreeformerPhDstudentsfrom ETH Zurichand MITMission: Provide mobile appdeveloperswithtoolstobuild…Atthecenterofourbusiness:Barcode scanningalgorithmdevelopedat ETH ZurichSDKHow is it different from Zxing, Zbar, etc.?All platformsLow-end AndroidphonesiPad2Faster (beforeautofocustriggers)Dynamic range (handlesclosecodeswell)
  • Netcetera

    1. 1. Cassandra for Barcodes, Products and Scans: The Backend Infrastructure at Scandit @scandit www.scandit.com February 1, 2012 Christof Roduner Co-founder and COO christof@scandit.com
    2. 2. 2 AGENDA  About Scandit  Requirements  Apache Cassandra  Scandit backend
    3. 3. 3 WHAT IS SCANDIT? Scandit provides developers best-in-class tools to build, analyze and monetize product-centric apps. ANALYZE User Interest MONETIZE Apps IDENTIFY Products
    4. 4. 4 IDENTIFY: BARCODE SCANNER  Scandit SDK  Fastest and most reliable barcode scanning technology for camera phones  Available for all major platforms:  iOS  Android  Symbian / Qt  Phonegap  Features:  Scans from any angle  Does not need autofocus  Works with low-end cameras (→ Android, iPad2)  Supports all barcode types (1D, 2D)
    5. 5. 5 DEMO VIDEO www.scandit.com/video
    6. 6. 6 ANALYZE: THE SCANALYTICS PLATFORM  Tool for app publishers  App-specific usage statistics  Insights into consumer behavior:  What do users scan?  Product categories? Groceries, electronics, books, cosmetics, …?  Where do users scan?  At home? Or while in a retail store?  Top products and brands  Identify new opportunities:  Customer engagement  Product interest  Cross-selling and up-selling
    7. 7. 7 ANALYZE: THE SCANALYTICS PLATFORM
    8. 8. 8 ANALYZE: THE SCANALYTICS PLATFORM
    9. 9. 9 BACKEND REQUIREMENTS  Product database  Many millions of products  Many different data sources  Curation of product data (filtering, etc.)  Analysis of scans  Accept and store high volumes of scans  Generate statistics over extended time periods  Correlate with product data  Provide reports to developers
    10. 10. 10 BACKEND DESIGN GOALS  Scalability  High-volume storage  High-volume throughput  Support large number of concurrent client requests (app)  Availability  Low maintenance
    11. 11. 11 WHICH DATABASE? Apache Cassandra  Large, distributed key-value store (DHT)  «NoSQL»  Inspired by:  Amazon’s Dynamo distributed storage system  Google’s BigTable data model  Originally developed at Facebook  Inbox search
    12. 12. 12 WHY DID WE CHOOSE IT?  Looked very fast  Even when data is much larger than RAM  Performs well in write-heavy environment  Proven scalability  Without downtime  Tunable replication  Easy to run and maintain  No sharding  All nodes are the same - no coordinators, masters, slaves, …  Data model  YMMV…
    13. 13. 13 WHAT YOU HAVE TO GIVE UP  Joins  Referential integrity  Transactions  Expressive query language  Consistency (tunable, but…)  Limited support for:  Schema  Secondary indices
    14. 14. 14 CASSANDRA DATA MODEL  Column families  Rows  Columns  (Supercolumns)  We’ll skip them - Cassandra developers don’t like them Disclaimer: I tend to say «hash» when I mean «dictionary, map, associative array» (Can you tell my favorite language?)
    15. 15. 15 COLUMNS AND ROWS  Column:  Is a name-value pair  Row:  Has exactly one key  Contains any number of columns  Columns are always automatically sorted by their name  Column family:  A collection of any number of rows (!)  Has a name  «Like a table»
    16. 16. 16 EXAMPLE COLUMN FAMILY  A column family «users» containing two rows  Columns can be different in every row  First row has a column named «phone», second row does not  Rows can have many columns  You can add millions of them "users": { "christof": { "email": "christof@scandit.com", "phone": "123-456-7890" } "moritz": { "email": "moritz@scandit.com", "web": "www.example.com" } } Row with key «christof» Two columns, automatically sorted by their names («email», «web»)
    17. 17. 17 DATA IN COLUMN NAMES  Column names can be used to store data  Frequent pattern in Cassandra  Takes advantage of column sorting "logins": { "christof": { "2012-01-29 16:22:30 +0100": "208.115.113.86", "2012-01-30 07:48:03 +0100": "66.249.66.183", "2012-01-30 18:06:55 +0100": "208.115.111.70", "2012-01-31 12:37:26 +0100": "66.249.66.183" } "moritz": { "2012-01-23 01:12:49 +0100": "205.209.190.116" } }
    18. 18. 18 SCHEMA AND DATA TYPES  Schema is optional  Data type can be defined for:  Keys  The values of all columns with a given name  The column names in a CF  By default, data type BLOB is used  Data Types  BLOB (default)  ASCII text  UTF8 text  Timestamp  Boolean  UUID  Integer (arbitrary length)  Float  Double  Decimal
    19. 19. 19 CLUSTER ORGANIZATION Node 3 Token 128 Node 2 Token 64 Node 4 Token 192 Node 1 Token 0 Range 1-64, stored on node 2 Range 65-128, stored on node 3
    20. 20. 20 STORING A ROW 1. Calculate md5 hash for row key Example: md5(“foobar") = 48 2. Determine data range for hash Example: 48 lies within range 1-64 3. Store row on node responsible for range Example: store on node 2 Node 3 Token 128 Node 2 Token 64 Node 4 Token 192 Node 1 Token 0 Range 1-64, stored on node 2 Range 65-128, stored on node 3
    21. 21. 21 IMPLICATIONS  Cluster automatically balanced  Load is shared equally between nodes  No hotspots  Scaling out?  Easy  Divide data ranges by adding more nodes  Cluster rebalances itself automatically  Range queries not possible  You can’t retrieve «all rows from A-C»  Rows are not stored in their «natural» order  Rows are stored in order of their md5 hashes
    22. 22. 22 IF YOU NEED RANGE QUERIES… Option 1: «Order Preserving Partitioner» (OPP)  OPP determines node based on a row’s key instead of its hash  Don’t use it…  Manually balancing a cluster is hard  Hotspots  Balancing cluster for one column family creates hotspot for another Option 2: Use columns instead of rows  Columns are always sorted  Rows can store millions of columns
    23. 23. 23 REPLICATION  Tunable replication factor (RF)  RF > 1: rows are automatically replicated to next RF-1 nodes  Tunable replication strategy  «Ensure two replicas in different data centers, racks, etc.» Node 3 Token 128 Node 2 Token 64 Node 4 Token 192 Node 1 Token 0 Replica 1 of row «foobar» Replica 2 of row «foobar»
    24. 24. 24 CLIENT ACCESS  Clients can send read and write requests to any node  This node will act as coordinator  Coordinator forwards request to nodes where data resides Node 3 Token 128 Node 2 Token 64 Node 4 Token 192 Node 1 Token 0 Client Request: insert( "foobar": { "email": "fb@example.com" } ) Replica 2 of row «foobar» Replica 1 of row «foobar»
    25. 25. 25 CONSISTENCY LEVELS  For all requests, clients can set a consistency level (CL)  For writes:  CL defines how many replicas must be written before «success» is returned to client  For reads:  CL defines how many replicas must respond before result is returned to client  Consistency levels:  ONE  QUORUM  ALL  … (data center-aware levels)
    26. 26. 26 INCONSISTENT DATA  Example scenario:  Replication factor 2  Two existing replica for row «foobar»  Client overwrites existing columns in «foobar»  Replica 2 is down  What happens:  Column is updated in replica 1, but not replica 2 (even with CL=ALL !)  Timestamps to the rescue  Every column has a timestamp  Timestamps are supplied by clients  Upon read, column with latest timestamp wins  →Use NTP
    27. 27. 27 PREVENTING INCONSISTENCIES  Read repair  Hinted handoff  Anti entropy
    28. 28. 28 RETRIEVING DATA (API)  At a row level, you can…  Get all rows  Get a single row by specifying its key  Get a number of rows by specifying their keys  Get a range of rows  Only with OPP, strongly discouraged  At a column level, you can…  Get all columns  Get a single column by specifying its name  Get a number of columns by specifying their names  Get a range of columns by specifying the name of the first and last column  Again: no ranges of rows
    29. 29. 29 CASSANDRA QUERY LANGUAGE (CQL) UPDATE users SET "email" = "christof@scandit.com", "phone" = "123-456-7890" WHERE KEY = "christof"; "users": { "christof": { "email": "christof@scandit.com", "phone": "123-456-7890" } "moritz": { "email": "moritz@scandit.com", "web": "www.example.com" } }
    30. 30. 30 CASSANDRA QUERY LANGUAGE (CQL) SELECT * FROM users WHERE KEY = "christof"; "users": { "christof": { "email": "christof@scandit.com", "phone": "123-456-7890" } "moritz": { "email": "moritz@scandit.com", "web": "www.example.com" } }
    31. 31. 31 CASSANDRA QUERY LANGUAGE (CQL) SELECT "2012-01-30 00:00:00 +0100" .. "2012-01-31 23:59:59 +0100" FROM logins WHERE KEY = "christof"; "logins": { "christof": { "2012-01-29 16:22:30 +0100": "208.115.113.86", "2012-01-30 07:48:03 +0100": "66.249.66.183", "2012-01-30 18:06:55 +0100": "208.115.111.70", "2012-01-31 12:37:26 +0100": "66.249.66.183" } "moritz": { "2012-01-23 01:12:49 +0100": "205.209.190.116" } }
    32. 32. 32 SECONDARY INDICES  Secondary indices can be defined for (single) columns  Secondary indices only support equality predicate (=) in queries  Each node maintains index for data it owns  When indexed column is queried, request must be forwarded to all nodes  Sometimes better to manually maintain your own index
    33. 33. 33 PRODUCTION EXPERIENCE  No stability issues  Very fast  Language bindings don’t have the same quality  Out of sync, bugs  Data model is a mental twist  Design-time decisions sometimes hard to change  Rudimentary access control
    34. 34. 34 TRYING OUT CASSANDRA  DataStax website  Company founded by Cassandra developers  Provides  Documentation  Amazon Machine Image  Apache website  Mailing lists
    35. 35. 35 CLUSTER AT SCANDIT  Several nodes in two data centers  Linux machines  Identical setup on every node  Allows for easy failover
    36. 36. 36 NODE ARCHITECTURE Website & REST API Ruby on Rails, Rack to other nodes frommobileappsandwebbrowsers Phusion Passenger mod_passenger
    37. 37. THANK YOU! www.scandit.com

    ×