SlideShare a Scribd company logo
Leveraging Big Data and Real-Time
Analytics at Cxense
Simon Lia-Jonassen
08/04/15
2
Our mission is to help companies understand their
audience and build great online user experiences.
– Stay longer on the site. – Sign up for subscriptions.
– Find interesting articles. – Buy recommended products.
About Cxense
3
Founded in 2010, ~100 employees in 2015.
Offices
–  Melbourne, Tokyo, Singapore, Stockholm, Copenhagen, Oslo*, London,
Buenos Aires, Rio de Janeiro, Miami, New-York, San-Francisco.
Some of our customers
About Cxense
4
Our solutions
5
How does it work!?
6
Event (example)
7
Content Profile (example)
8
9
Data Volume and Traffic
–  5K+ Web-sites
–  50M+ pages (last month)
–  500M+ users (last month)
–  10B+ events/month (20K events/sec peak)
Heterogeneity and Reliability
–  Hundreds of mobile and desktop platforms, browsers, internet providers, etc.
–  Multiple devices per user, cross-domain tracking (3rd party cookie is dying).
–  Web-pages (articles, image/video galleries, chats, search/front pages) and human language.
–  The Internet is Broken™
Constrains and Requirements
–  Online and real-time processing
•  Show and analyze what is happening right now.
–  High and sustainable performance
•  Throughput: peak-load 10K+ request/sec.
•  Latency: 100ms latency constrain for ads and recs.
–  Fault-tolerance and durability
Challenges
10
Architecture and Data Flow (simplified)
11
Communication
–  HTTP with JSON payload.
–  Durable and Idempotent.
Local storage
–  Atomically append to file.
–  Use a new file each hour.
–  Use a separate directory for each partition.
–  Tail files and/or directories.
Metadata
–  Keeps the state.
–  Can go backwards and re-feed when needed.
System
–  Semi-automatic configuration via Upstart and Crontab.
–  Monitoring via Graphite and log files.
–  Automatic alerting and centralized log search.
Data Flow and Feeding
12
What is The Cube?
–  Partitioned column store database.
–  Using efficient string handling and integer compression.
–  Provides fast filtering and aggregation over 50B data points.
–  Guarantees low update latency (100ms).
–  Exists in multiple variants:
•  Disk or memory based.
•  Partitioned by site, by user or by both.
–  Low-level API.
Example:
The Cube
© imdb.com
!me	
   user	
   rnd	
   siteid	
   url	
  
	
  
browser	
  
1409425329634	
   “4szi”	
   “xzst”	
   “9978”	
   “cxnews.com”	
   “Chrome”	
  
1409425329634	
   “zthp”	
   “fd0z”	
   “9978”	
   “cxnews.com/seahawks-­‐win-­‐again…”	
   “Firefox”	
  
1409425329635	
   “4szi”	
   “tzdt”	
   “9978”	
   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”	
   “Chrome”	
  
1409425329640	
   “4szi”	
   “aext”	
   “9978”	
   “cxnews.com/elon-­‐musk-­‐is-­‐awes…”	
   “Chrome”	
  
1409425329640	
   “zx5t”	
   “dxrf”	
   “9978”	
   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”	
   “Safari”	
  
13
Frame of Reference Compression
–  Compress the numbers in groups of 64.
–  If the sequence is increasing – use the first number as the reference and compute the
differences between each two consecutive numbers (deltas).
–  Find the maximum number of bits (width) needed to represent the larges delta and
compress the deltas using fixed bit width.
–  For non-increasing sequences, use the smallest number as the reference and the
differences between the numbers and the reference as deltas.
The Cube – Integer Columns
14
–  A global lexicon maps all strings to numbers and back.
–  For each column, we map global keys to a smaller set of numbers and back.
The Cube – String Columns
15
Filter
–  Keep a bit-filter over a particular range of rows as a state.
Filtering
–  By number or range – pass through a column and update the filter.
Use binary search for ordered columns such as time, inverted index for user id.
–  By key – map the key to a number and filter by the number.
–  By set of keys – map the keys to a bit-set and filter using the bit-set.
–  By pattern – filter by the set of keys matching the pattern.
Logical operations
–  AND, OR, NOT – use unary negation, binary intersection/join and a stack of filters.
Advanced operations
–  Use aggregation output as filtering input (e.g., top-list, explosion, histogram, etc.).
–  Join between different cubes on one or multiple dimensions.
The Cube – Filtering
16
Operations
–  Count – count the number of bits in the filter.
–  Sum – sum the numbers where filter bit is set.
–  Cardinality – count the number of distinct keys/numbers.
–  CardinalityEstimator – create a HyperLogLog cardinality estimator.
–  Frequency – create a map of keys/numbers with the associated count.
–  TopList – create a frequency map with only the k most popular keys/numbers.
–  SumBy – create a map of keys/numbers with the associated sum.
–  CardinalityMap – create a map of keys/numbers with the associated sum.
–  FrequencyDistribution – create a histogram over frequencies.
–  CardinalityDistribution – create a histogram over cardinalities.
–  SumByDistribution – create a histogram over sums.
–  NumericalStatistics – compute distribution statistics for numbers (min, max, percentiles).
The Cube – Aggregation
17
Partitioning
–  Most of the data structures are partitioned into chunks of data in order to improve memory
allocation, materialization, skipping, compression and locking.
Static and dynamic parts
–  Each data column, lexicon or mapping consist of a static and a dynamic part.
–  The static part is ordered – can use binary search and Minimal Perfect Hashing.
–  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees.
Locking
–  Distinct Read and Read-Write Locks with different granularity/scope.
–  The updates are mostly appends, but some of the columns might be updated later (e.g.,
active time, exit query, etc.).
Maintenance
–  Periodically flush the dynamic part into the static part.
–  Remove the old data, delete unused strings, optimize the mapping.
The Cube – Updates
18
Keyword vectors
–  Represent user and document profiles.
–  Each contain as a document id, version and a set of group-item pairs with a weight.
–  Stored in a separate, highly partitioned set of containers.
–  Each container keeps multiple groups.
–  Each group contains a document ids, items and weights as columns.
The Cube – Advanced Data Types
19
Structured data
–  Can represent any simple JSON object (document).
–  Node types: Null, Object, Array, Integer, Float, String, Boolean.
–  Stored in a separate container, separate columns for each node type.
–  Each document is decomposed into a list of paths and nodes.
–  Each node is added to the corresponding column.
The Cube – Advanced Data Types
20
Analytics API
–  RESTful API – client-server, HTTP requests and response codes, stateless, cacheable, etc.
–  API resource paths, JSON in - JSON out.
–  Most of the APIs require authentication.
–  Simple integration via cx.py, Java/JavaScript/C#/Python/Perl/PHP or HTTP calls directly.
Traffic API
–  A rich set of high-level API.
–  Powerful ad-hoc syntax – types, groups, items, filters, fields, etc.
–  See the demo!
Analytics UI
–  HTML and JavaScript.
–  Is built on top of the Analytics API.
–  Has multiple fixed, functional views which can be combined with arbitrary filters.
–  Premium users have a workspace area for dynamic, configurable widgets.
Analytics API and UI
21
Demo Session
Thank you!
Questions?
Credits: Erik Gorset & Oslo Dev Team
23
…btw, we are hiring!
www.cxense.com
https://twitter.com/cxense
www.facebook.com/cxense
www.linkedin.com/company/cxense
Connect with Cxense
simon.jonassen@cxense.com
©http://www.perspectivaconica.com/

More Related Content

What's hot

Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
Bill GU
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
vanjakom
 
Google BigTable
Google BigTableGoogle BigTable
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
vanjakom
 
Bigdata & Hadoop
Bigdata & HadoopBigdata & Hadoop
Bigdata & Hadoop
Pinto Das
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
Stuart Ainsworth
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Rahul Johari
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
RojaT4
 
MongoDB NYC Python
MongoDB NYC PythonMongoDB NYC Python
MongoDB NYC Python
Mike Dirolf
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
MongoDb - Details on the POC
MongoDb - Details on the POCMongoDb - Details on the POC
MongoDb - Details on the POC
Amardeep Vishwakarma
 
Hadoop
HadoopHadoop
Big table
Big tableBig table
Big table
PSIT
 
Google Big Table
Google Big TableGoogle Big Table
Google Big Table
Omar Al-Sabek
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 

What's hot (20)

Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
 
Google BigTable
Google BigTableGoogle BigTable
Google BigTable
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
Bigdata & Hadoop
Bigdata & HadoopBigdata & Hadoop
Bigdata & Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
MongoDB NYC Python
MongoDB NYC PythonMongoDB NYC Python
MongoDB NYC Python
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
MongoDb - Details on the POC
MongoDb - Details on the POCMongoDb - Details on the POC
MongoDb - Details on the POC
 
Hadoop
HadoopHadoop
Hadoop
 
Big table
Big tableBig table
Big table
 
Google Big Table
Google Big TableGoogle Big Table
Google Big Table
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 

Similar to Leveraging Big Data and Real-Time Analytics at Cxense

Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
Sylvain Wallez
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
David Pilato
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
DataStax Academy
 
IoT interoperability
IoT interoperabilityIoT interoperability
IoT interoperability
1248 Ltd.
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
jins0618
 
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
WSO2
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
Marko Grobelnik
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
Jornadas gvSIG 2009 WSS English
Jornadas gvSIG 2009 WSS EnglishJornadas gvSIG 2009 WSS English
Jornadas gvSIG 2009 WSS English
sabueso81
 
Dbs302 driving a realtime personalization engine with cloud bigtable
Dbs302  driving a realtime personalization engine with cloud bigtableDbs302  driving a realtime personalization engine with cloud bigtable
Dbs302 driving a realtime personalization engine with cloud bigtable
Calvin French-Owen
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
Cloud Elements
 
T-Mobile and Elastic
T-Mobile and ElasticT-Mobile and Elastic
T-Mobile and Elastic
Elasticsearch
 
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
Pavel Hardak
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
David Pilato
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
David Pilato
 

Similar to Leveraging Big Data and Real-Time Analytics at Cxense (20)

Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
 
IoT interoperability
IoT interoperabilityIoT interoperability
IoT interoperability
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
 
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
Jornadas gvSIG 2009 WSS English
Jornadas gvSIG 2009 WSS EnglishJornadas gvSIG 2009 WSS English
Jornadas gvSIG 2009 WSS English
 
Dbs302 driving a realtime personalization engine with cloud bigtable
Dbs302  driving a realtime personalization engine with cloud bigtableDbs302  driving a realtime personalization engine with cloud bigtable
Dbs302 driving a realtime personalization engine with cloud bigtable
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
T-Mobile and Elastic
T-Mobile and ElasticT-Mobile and Elastic
T-Mobile and Elastic
 
Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)Building Scalable IoT Apps (QCon S-F)
Building Scalable IoT Apps (QCon S-F)
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 

More from Simon Lia-Jonassen

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
Simon Lia-Jonassen
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
Simon Lia-Jonassen
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
Simon Lia-Jonassen
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
Simon Lia-Jonassen
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
Simon Lia-Jonassen
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 

More from Simon Lia-Jonassen (8)

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 

Recently uploaded

How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Leveraging Big Data and Real-Time Analytics at Cxense

  • 1. Leveraging Big Data and Real-Time Analytics at Cxense Simon Lia-Jonassen 08/04/15
  • 2. 2 Our mission is to help companies understand their audience and build great online user experiences. – Stay longer on the site. – Sign up for subscriptions. – Find interesting articles. – Buy recommended products. About Cxense
  • 3. 3 Founded in 2010, ~100 employees in 2015. Offices –  Melbourne, Tokyo, Singapore, Stockholm, Copenhagen, Oslo*, London, Buenos Aires, Rio de Janeiro, Miami, New-York, San-Francisco. Some of our customers About Cxense
  • 5. 5 How does it work!?
  • 8. 8
  • 9. 9 Data Volume and Traffic –  5K+ Web-sites –  50M+ pages (last month) –  500M+ users (last month) –  10B+ events/month (20K events/sec peak) Heterogeneity and Reliability –  Hundreds of mobile and desktop platforms, browsers, internet providers, etc. –  Multiple devices per user, cross-domain tracking (3rd party cookie is dying). –  Web-pages (articles, image/video galleries, chats, search/front pages) and human language. –  The Internet is Broken™ Constrains and Requirements –  Online and real-time processing •  Show and analyze what is happening right now. –  High and sustainable performance •  Throughput: peak-load 10K+ request/sec. •  Latency: 100ms latency constrain for ads and recs. –  Fault-tolerance and durability Challenges
  • 10. 10 Architecture and Data Flow (simplified)
  • 11. 11 Communication –  HTTP with JSON payload. –  Durable and Idempotent. Local storage –  Atomically append to file. –  Use a new file each hour. –  Use a separate directory for each partition. –  Tail files and/or directories. Metadata –  Keeps the state. –  Can go backwards and re-feed when needed. System –  Semi-automatic configuration via Upstart and Crontab. –  Monitoring via Graphite and log files. –  Automatic alerting and centralized log search. Data Flow and Feeding
  • 12. 12 What is The Cube? –  Partitioned column store database. –  Using efficient string handling and integer compression. –  Provides fast filtering and aggregation over 50B data points. –  Guarantees low update latency (100ms). –  Exists in multiple variants: •  Disk or memory based. •  Partitioned by site, by user or by both. –  Low-level API. Example: The Cube © imdb.com !me   user   rnd   siteid   url     browser   1409425329634   “4szi”   “xzst”   “9978”   “cxnews.com”   “Chrome”   1409425329634   “zthp”   “fd0z”   “9978”   “cxnews.com/seahawks-­‐win-­‐again…”   “Firefox”   1409425329635   “4szi”   “tzdt”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Chrome”   1409425329640   “4szi”   “aext”   “9978”   “cxnews.com/elon-­‐musk-­‐is-­‐awes…”   “Chrome”   1409425329640   “zx5t”   “dxrf”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Safari”  
  • 13. 13 Frame of Reference Compression –  Compress the numbers in groups of 64. –  If the sequence is increasing – use the first number as the reference and compute the differences between each two consecutive numbers (deltas). –  Find the maximum number of bits (width) needed to represent the larges delta and compress the deltas using fixed bit width. –  For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas. The Cube – Integer Columns
  • 14. 14 –  A global lexicon maps all strings to numbers and back. –  For each column, we map global keys to a smaller set of numbers and back. The Cube – String Columns
  • 15. 15 Filter –  Keep a bit-filter over a particular range of rows as a state. Filtering –  By number or range – pass through a column and update the filter. Use binary search for ordered columns such as time, inverted index for user id. –  By key – map the key to a number and filter by the number. –  By set of keys – map the keys to a bit-set and filter using the bit-set. –  By pattern – filter by the set of keys matching the pattern. Logical operations –  AND, OR, NOT – use unary negation, binary intersection/join and a stack of filters. Advanced operations –  Use aggregation output as filtering input (e.g., top-list, explosion, histogram, etc.). –  Join between different cubes on one or multiple dimensions. The Cube – Filtering
  • 16. 16 Operations –  Count – count the number of bits in the filter. –  Sum – sum the numbers where filter bit is set. –  Cardinality – count the number of distinct keys/numbers. –  CardinalityEstimator – create a HyperLogLog cardinality estimator. –  Frequency – create a map of keys/numbers with the associated count. –  TopList – create a frequency map with only the k most popular keys/numbers. –  SumBy – create a map of keys/numbers with the associated sum. –  CardinalityMap – create a map of keys/numbers with the associated sum. –  FrequencyDistribution – create a histogram over frequencies. –  CardinalityDistribution – create a histogram over cardinalities. –  SumByDistribution – create a histogram over sums. –  NumericalStatistics – compute distribution statistics for numbers (min, max, percentiles). The Cube – Aggregation
  • 17. 17 Partitioning –  Most of the data structures are partitioned into chunks of data in order to improve memory allocation, materialization, skipping, compression and locking. Static and dynamic parts –  Each data column, lexicon or mapping consist of a static and a dynamic part. –  The static part is ordered – can use binary search and Minimal Perfect Hashing. –  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees. Locking –  Distinct Read and Read-Write Locks with different granularity/scope. –  The updates are mostly appends, but some of the columns might be updated later (e.g., active time, exit query, etc.). Maintenance –  Periodically flush the dynamic part into the static part. –  Remove the old data, delete unused strings, optimize the mapping. The Cube – Updates
  • 18. 18 Keyword vectors –  Represent user and document profiles. –  Each contain as a document id, version and a set of group-item pairs with a weight. –  Stored in a separate, highly partitioned set of containers. –  Each container keeps multiple groups. –  Each group contains a document ids, items and weights as columns. The Cube – Advanced Data Types
  • 19. 19 Structured data –  Can represent any simple JSON object (document). –  Node types: Null, Object, Array, Integer, Float, String, Boolean. –  Stored in a separate container, separate columns for each node type. –  Each document is decomposed into a list of paths and nodes. –  Each node is added to the corresponding column. The Cube – Advanced Data Types
  • 20. 20 Analytics API –  RESTful API – client-server, HTTP requests and response codes, stateless, cacheable, etc. –  API resource paths, JSON in - JSON out. –  Most of the APIs require authentication. –  Simple integration via cx.py, Java/JavaScript/C#/Python/Perl/PHP or HTTP calls directly. Traffic API –  A rich set of high-level API. –  Powerful ad-hoc syntax – types, groups, items, filters, fields, etc. –  See the demo! Analytics UI –  HTML and JavaScript. –  Is built on top of the Analytics API. –  Has multiple fixed, functional views which can be combined with arbitrary filters. –  Premium users have a workspace area for dynamic, configurable widgets. Analytics API and UI
  • 22. Thank you! Questions? Credits: Erik Gorset & Oslo Dev Team
  • 23. 23 …btw, we are hiring! www.cxense.com https://twitter.com/cxense www.facebook.com/cxense www.linkedin.com/company/cxense Connect with Cxense simon.jonassen@cxense.com ©http://www.perspectivaconica.com/