Leveraging Big Data and Real-Time Analytics at Cxense

Leveraging Big Data and Real-Time
Analytics at Cxense
Simon Lia-Jonassen
08/04/15

2
Our mission is to help companies understand their
audience and build great online user experiences.
– Stay longer on the site. – Sign up for subscriptions.
– Find interesting articles. – Buy recommended products.
About Cxense

3
Founded in 2010, ~100 employees in 2015.
Offices
–  Melbourne, Tokyo, Singapore, Stockholm, Copenhagen, Oslo*, London,
Buenos Aires, Rio de Janeiro, Miami, New-York, San-Francisco.
Some of our customers
About Cxense

9
Data Volume and Traffic
–  5K+ Web-sites
–  50M+ pages (last month)
–  500M+ users (last month)
–  10B+ events/month (20K events/sec peak)
Heterogeneity and Reliability
–  Hundreds of mobile and desktop platforms, browsers, internet providers, etc.
–  Multiple devices per user, cross-domain tracking (3rd party cookie is dying).
–  Web-pages (articles, image/video galleries, chats, search/front pages) and human language.
–  The Internet is Broken™
Constrains and Requirements
–  Online and real-time processing
•  Show and analyze what is happening right now.
–  High and sustainable performance
•  Throughput: peak-load 10K+ request/sec.
•  Latency: 100ms latency constrain for ads and recs.
–  Fault-tolerance and durability
Challenges

10
Architecture and Data Flow (simplified)

11
Communication
–  HTTP with JSON payload.
–  Durable and Idempotent.
Local storage
–  Atomically append to file.
–  Use a new file each hour.
–  Use a separate directory for each partition.
–  Tail files and/or directories.
Metadata
–  Keeps the state.
–  Can go backwards and re-feed when needed.
System
–  Semi-automatic configuration via Upstart and Crontab.
–  Monitoring via Graphite and log files.
–  Automatic alerting and centralized log search.
Data Flow and Feeding

12
What is The Cube?
–  Partitioned column store database.
–  Using efficient string handling and integer compression.
–  Provides fast filtering and aggregation over 50B data points.
–  Guarantees low update latency (100ms).
–  Exists in multiple variants:
•  Disk or memory based.
•  Partitioned by site, by user or by both.
–  Low-level API.
Example:
The Cube
© imdb.com
!me
user
rnd
siteid
url

browser

1409425329634
“4szi”
“xzst”
“9978”
“cxnews.com”
“Chrome”

1409425329634
“zthp”
“fd0z”
“9978”
“cxnews.com/seahawks-‐win-‐again…”
“Firefox”

1409425329635
“4szi”
“tzdt”
“9978”
“cxnews.com/tesla-‐model-‐3-‐will-‐…”
“Chrome”

1409425329640
“4szi”
“aext”
“9978”
“cxnews.com/elon-‐musk-‐is-‐awes…”
“Chrome”

1409425329640
“zx5t”
“dxrf”
“9978”
“cxnews.com/tesla-‐model-‐3-‐will-‐…”
“Safari”

13
Frame of Reference Compression
–  Compress the numbers in groups of 64.
–  If the sequence is increasing – use the first number as the reference and compute the
differences between each two consecutive numbers (deltas).
–  Find the maximum number of bits (width) needed to represent the larges delta and
compress the deltas using fixed bit width.
–  For non-increasing sequences, use the smallest number as the reference and the
differences between the numbers and the reference as deltas.
The Cube – Integer Columns

14
–  A global lexicon maps all strings to numbers and back.
–  For each column, we map global keys to a smaller set of numbers and back.
The Cube – String Columns

15
Filter
–  Keep a bit-filter over a particular range of rows as a state.
Filtering
–  By number or range – pass through a column and update the filter.
Use binary search for ordered columns such as time, inverted index for user id.
–  By key – map the key to a number and filter by the number.
–  By set of keys – map the keys to a bit-set and filter using the bit-set.
–  By pattern – filter by the set of keys matching the pattern.
Logical operations
–  AND, OR, NOT – use unary negation, binary intersection/join and a stack of filters.
Advanced operations
–  Use aggregation output as filtering input (e.g., top-list, explosion, histogram, etc.).
–  Join between different cubes on one or multiple dimensions.
The Cube – Filtering

16
Operations
–  Count – count the number of bits in the filter.
–  Sum – sum the numbers where filter bit is set.
–  Cardinality – count the number of distinct keys/numbers.
–  CardinalityEstimator – create a HyperLogLog cardinality estimator.
–  Frequency – create a map of keys/numbers with the associated count.
–  TopList – create a frequency map with only the k most popular keys/numbers.
–  SumBy – create a map of keys/numbers with the associated sum.
–  CardinalityMap – create a map of keys/numbers with the associated sum.
–  FrequencyDistribution – create a histogram over frequencies.
–  CardinalityDistribution – create a histogram over cardinalities.
–  SumByDistribution – create a histogram over sums.
–  NumericalStatistics – compute distribution statistics for numbers (min, max, percentiles).
The Cube – Aggregation

17
Partitioning
–  Most of the data structures are partitioned into chunks of data in order to improve memory
allocation, materialization, skipping, compression and locking.
Static and dynamic parts
–  Each data column, lexicon or mapping consist of a static and a dynamic part.
–  The static part is ordered – can use binary search and Minimal Perfect Hashing.
–  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees.
Locking
–  Distinct Read and Read-Write Locks with different granularity/scope.
–  The updates are mostly appends, but some of the columns might be updated later (e.g.,
active time, exit query, etc.).
Maintenance
–  Periodically flush the dynamic part into the static part.
–  Remove the old data, delete unused strings, optimize the mapping.
The Cube – Updates

18
Keyword vectors
–  Represent user and document profiles.
–  Each contain as a document id, version and a set of group-item pairs with a weight.
–  Stored in a separate, highly partitioned set of containers.
–  Each container keeps multiple groups.
–  Each group contains a document ids, items and weights as columns.
The Cube – Advanced Data Types

19
Structured data
–  Can represent any simple JSON object (document).
–  Node types: Null, Object, Array, Integer, Float, String, Boolean.
–  Stored in a separate container, separate columns for each node type.
–  Each document is decomposed into a list of paths and nodes.
–  Each node is added to the corresponding column.
The Cube – Advanced Data Types

20
Analytics API
–  RESTful API – client-server, HTTP requests and response codes, stateless, cacheable, etc.
–  API resource paths, JSON in - JSON out.
–  Most of the APIs require authentication.
–  Simple integration via cx.py, Java/JavaScript/C#/Python/Perl/PHP or HTTP calls directly.
Traffic API
–  A rich set of high-level API.
–  Powerful ad-hoc syntax – types, groups, items, filters, fields, etc.
–  See the demo!
Analytics UI
–  HTML and JavaScript.
–  Is built on top of the Analytics API.
–  Has multiple fixed, functional views which can be combined with arbitrary filters.
–  Premium users have a workspace area for dynamic, configurable widgets.
Analytics API and UI

Thank you!
Questions?
Credits: Erik Gorset & Oslo Dev Team

23
…btw, we are hiring!
www.cxense.com
https://twitter.com/cxense
www.facebook.com/cxense
www.linkedin.com/company/cxense
Connect with Cxense
simon.jonassen@cxense.com
©http://www.perspectivaconica.com/

Leveraging Big Data and Real-Time Analytics at Cxense

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Leveraging Big Data and Real-Time Analytics at Cxense

Similar to Leveraging Big Data and Real-Time Analytics at Cxense (20)

More from Simon Lia-Jonassen

More from Simon Lia-Jonassen (8)

Recently uploaded

Recently uploaded (20)

Leveraging Big Data and Real-Time Analytics at Cxense