SlideShare a Scribd company logo
Large-Scale Real-Time Data Management
for Engagement and Monetization
Simon Lia-Jonassen
LSDS-IR 2015
Our mission is to help companies understand their
audience and build great online user experiences.
– Find interesting articles. – Stay longer on the site
– Get relevant ads. – Sign up for subscriptions.
Some of our customers:
About Cxense
Our solutions
Cxense DMP
How does it work!? (JavaScript tag example)
Page view events (example)
Content profiles (example)
Custom events (example)
UI and API capabilities
UI and API capabilities
User Segments
Data Volume and Traffic
(monthly)
–  5 000 active Web-sites
–  100 million pages
–  1 billion users
–  15 billion page views
Constrains and Requirements
–  Online and real-time processing
•  Show, analyze and act on what is happening
exactly right now.
–  High and sustainable performance
•  Peak-load 10K+ request/sec.
•  50 ms latency constrain for ads and recs.
–  Availability, reliability, durability
•  multi DC and fault-tolerance
–  Security and privacy
Challenges
Heterogeneity and Reliability
–  Hundreds of mobile and desktop platforms, browsers, internet providers, etc.
–  Multiple browsers and devices per user, cross-domain tracking (3rd party cookies are dying out).
–  Web-pages (articles, image/video galleries, chats, search/front pages) and human language.
–  The Internet is Broken™
Customer success
–  Providing the right insights.
•  Data, metrics and visualization.
–  Providing the right set of tools.
•  Usability, brevity, expressiveness,
completeness.
–  Best practices.
•  Analytics, ads, recs, user engagement,
personalization and subscription optimization.
–  Onboarding and support.
Challenges
Communication
–  HTTP with JSON payload.
–  Durable and Idempotent.
Local storage
–  Atomically append to file.
–  Use a separate directory for each
partition and a new file each hour.
–  Tail files and/or directories.
Metadata
–  Keeps the state.
–  Rewind and re-feed when needed.
System
–  Configured via Upstart and Cron.
–  Monitoring via Graphite and log files.
–  Automatic alerting.
Architecture and Data Flow
Data Cubes
–  Partitioned column store database.
–  Efficient string handling and integer compression.
–  Fast filtering and aggregation over billions of data points.
–  Low update latency (100ms).
–  Exists in multiple variants:
•  Disk or memory based.
•  Partitioned by site, by user or by both.
–  Low-level API.
Example:
The Cube
!me	
   user	
   rnd	
   siteid	
   url	
  
	
  
browser	
  
1409425329634	
   “4szi”	
   “xzst”	
   “9978”	
   “cxnews.com”	
   “Chrome”	
  
1409425329634	
   “zthp”	
   “fd0z”	
   “9978”	
   “cxnews.com/seahawks-­‐win-­‐again…”	
   “Firefox”	
  
1409425329635	
   “4szi”	
   “tzdt”	
   “9978”	
   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”	
   “Chrome”	
  
1409425329640	
   “4szi”	
   “aext”	
   “9978”	
   “cxnews.com/elon-­‐musk-­‐is-­‐awes…”	
   “Chrome”	
  
1409425329640	
   “zx5t”	
   “dxrf”	
   “9978”	
   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”	
   “Safari”	
  
Frame of Reference Compression
–  Compress the numbers in groups of 64.
–  If the sequence is increasing – use the first number as the reference and compute the
differences between each two consecutive numbers (deltas).
–  Find the maximum number of bits (width) needed to represent the larges delta and
compress the deltas using fixed bit width.
–  For non-increasing sequences, use the smallest number as the reference and the
differences between the numbers and the reference as deltas.
The Cube – Integer Columns
–  A global lexicon maps all strings to numbers and back.
–  For each column, map global keys to a smaller set of numbers and back.
The Cube – String Columns
Structured data
–  Can represent any simple JSON object (document).
–  Node types: Null, Object, Array, Integer, Float, String, Boolean.
–  Stored in a separate container, separate columns for each node type.
–  Each document is decomposed into a list of paths and nodes.
–  Each node is added to the corresponding column.
The Cube – Advanced Data Types
Filtering operations and tricks:
–  Keep a bit-filter over a range of rows (1 = exclude).
–  By a number or range – unset bits where numbers not match. Can use binary search for ordered
columns such as time, and inverted indexes for unordered such as user id.
–  By a key or set of key – map keys to a number or bit-set and filter.
–  By pattern – filter by the set of keys matching the pattern.
–  Logical AND, OR, NOT – use a stack of filters and binary operations.
The Cube – Filtering and Aggregation
Some aggregation operations and tricks:
–  Count, Sum, Cardinaltiy – bit-counting. Can use HLL for distributed cardinality.
–  Frequency, SumBy, CardinalityMap – sorting and bit-counting using pairs of integers.
–  Frequency-, SumBy–, CardinalityDistribution – histograms, more sorting and bit-counting.
The Cube – Filtering and Aggregation
Advanced operations
–  Use aggregation output as filtering input (e.g., top-list, histogram, etc.).
–  Join between cubes on one or multiple dimensions.
The Cube – Filtering and Aggregation
Partitioning
–  Most of the data structures are partitioned into chunks of data.
–  This improves memory allocation, materialization, skipping, compression and locking.
Static and dynamic parts
–  Each data column, lexicon or mapping consist of a static and a dynamic part.
–  The static part is ordered – use binary search and Minimal Perfect Hashing.
–  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees.
–  Updates are mostly appends, but updates can also be done via deletion and a new write.
Maintenance
–  Periodically flush the dynamic part into the static part.
–  Remove the old data, delete unused strings, optimize the mapping.
The Cube – Updates
Thank you!
Questions?
Credits: Erik Gorset and the Oslo R&D Team
simon.jonassen@cxense.com …btw,	
  we	
  are	
  hiring!	
  
cxense.com
facebook.com/cxense
twitter.com/cxense
linkedin.com/company/cxense
youtube.com/user/cxense
One more thing… the Internet of Things!

More Related Content

What's hot

JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
Jordan Open Source Association
 
Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
Diego Pacheco
 
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
Amardeep Vishwakarma
 
MongoDB NYC Python
MongoDB NYC PythonMongoDB NYC Python
MongoDB NYC Python
Mike Dirolf
 
Bigtable
BigtableBigtable
Bigtable
Amir Payberah
 
Mongo db
Mongo dbMongo db
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
HashedIn Technologies
 
2011 mongo FR - scaling with mongodb
2011 mongo FR - scaling with mongodb2011 mongo FR - scaling with mongodb
2011 mongo FR - scaling with mongodb
antoinegirbal
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
Stuart Ainsworth
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
葵慶 李
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
vanjakom
 
rhbase_tutorial
rhbase_tutorialrhbase_tutorial
rhbase_tutorial
Aaron Benz
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
Max Neunhöffer
 
Florida State University Open Stack
Florida State University Open StackFlorida State University Open Stack
Florida State University Open Stack
inside-BigData.com
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
Big table
Big tableBig table
Big table
PSIT
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Rabindra Nath Nandi
 
Redis memory optimization sripathi, CTO hashedin
Redis memory optimization   sripathi, CTO hashedinRedis memory optimization   sripathi, CTO hashedin
Redis memory optimization sripathi, CTO hashedin
HashedIn Technologies
 
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Jan Polowinski
 

What's hot (20)

JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
 
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
MongoDB NYC Python
MongoDB NYC PythonMongoDB NYC Python
MongoDB NYC Python
 
Bigtable
BigtableBigtable
Bigtable
 
Mongo db
Mongo dbMongo db
Mongo db
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
2011 mongo FR - scaling with mongodb
2011 mongo FR - scaling with mongodb2011 mongo FR - scaling with mongodb
2011 mongo FR - scaling with mongodb
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
 
rhbase_tutorial
rhbase_tutorialrhbase_tutorial
rhbase_tutorial
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
Florida State University Open Stack
Florida State University Open StackFlorida State University Open Stack
Florida State University Open Stack
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Big table
Big tableBig table
Big table
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Redis memory optimization sripathi, CTO hashedin
Redis memory optimization   sripathi, CTO hashedinRedis memory optimization   sripathi, CTO hashedin
Redis memory optimization sripathi, CTO hashedin
 
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
 

Viewers also liked

Leveraging big data to drive marketing innovation
Leveraging big data to drive marketing innovationLeveraging big data to drive marketing innovation
Leveraging big data to drive marketing innovation
Andrew Leone
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
Simon Lia-Jonassen
 
IMPROVING ORDER-TO-CASH CYCLE.
IMPROVING ORDER-TO-CASH CYCLE.IMPROVING ORDER-TO-CASH CYCLE.
IMPROVING ORDER-TO-CASH CYCLE.
Enterprise Technology Management (ETM)
 
Serpientes extrañas, Victor Madera
Serpientes extrañas, Victor MaderaSerpientes extrañas, Victor Madera
Serpientes extrañas, Victor Madera
Víctor Madera
 
EL AYUNTAMIENTO ENTREGA 12 NUEVAS VIVIENDAS EN ALQUILER PARA JÓVENES EN LA ZO...
EL AYUNTAMIENTO ENTREGA 12 NUEVAS VIVIENDAS EN ALQUILER PARA JÓVENES EN LA ZO...EL AYUNTAMIENTO ENTREGA 12 NUEVAS VIVIENDAS EN ALQUILER PARA JÓVENES EN LA ZO...
EL AYUNTAMIENTO ENTREGA 12 NUEVAS VIVIENDAS EN ALQUILER PARA JÓVENES EN LA ZO...
Ayuntamiento de Málaga
 
Managing The Virtualized Enterprise New Technology, New Challenges
Managing The Virtualized Enterprise New Technology, New ChallengesManaging The Virtualized Enterprise New Technology, New Challenges
Managing The Virtualized Enterprise New Technology, New Challenges
Enterprise Technology Management (ETM)
 
2017 02-19 Meetup Slides
2017 02-19 Meetup Slides2017 02-19 Meetup Slides
Semiotic analysis of saw pp
Semiotic analysis of saw ppSemiotic analysis of saw pp
Semiotic analysis of saw pp
10ADunne
 
Action research on work place conflict and strategy to solve the problem
Action research on work place conflict and strategy to solve the problemAction research on work place conflict and strategy to solve the problem
Action research on work place conflict and strategy to solve the problem
berhanu taye
 
Doc1pdf
Doc1pdfDoc1pdf
Slope Powerpoint
Slope PowerpointSlope Powerpoint
Slope Powerpoint
guesta9ee9b
 
Photo selection (motorway)
Photo selection (motorway)Photo selection (motorway)
Photo selection (motorway)
Bailey Warburton
 
Data management plans (DMPs)- 16 Feb 2017
Data management plans (DMPs)- 16 Feb 2017 Data management plans (DMPs)- 16 Feb 2017
Data management plans (DMPs)- 16 Feb 2017
ARDC
 
국내외 핀테크(Fintech) 동향과 전망
국내외 핀테크(Fintech) 동향과 전망국내외 핀테크(Fintech) 동향과 전망
국내외 핀테크(Fintech) 동향과 전망
메가트렌드랩 megatrendlab
 
주간금융경제동향(우리금융연구소)
주간금융경제동향(우리금융연구소)주간금융경제동향(우리금융연구소)
주간금융경제동향(우리금융연구소)
메가트렌드랩 megatrendlab
 
핀테크 산업 트렌드 및 시사점
핀테크 산업 트렌드 및 시사점핀테크 산업 트렌드 및 시사점
핀테크 산업 트렌드 및 시사점
메가트렌드랩 megatrendlab
 
Workshop with The Trade Desk, Digiday Agency Summit, March 2017
Workshop with The Trade Desk, Digiday Agency Summit, March 2017Workshop with The Trade Desk, Digiday Agency Summit, March 2017
Workshop with The Trade Desk, Digiday Agency Summit, March 2017
Digiday
 
Turning obstacles into opportunities, Digiday Agency Summit, March 2017
Turning obstacles into opportunities, Digiday Agency Summit, March 2017Turning obstacles into opportunities, Digiday Agency Summit, March 2017
Turning obstacles into opportunities, Digiday Agency Summit, March 2017
Digiday
 
Tugas4 0317-nasrulakbar-141250552
Tugas4 0317-nasrulakbar-141250552Tugas4 0317-nasrulakbar-141250552
Tugas4 0317-nasrulakbar-141250552
Nasrul Akbar
 
2017-03-09 OE Global MOOQ Interactive Workshop Results
2017-03-09 OE Global MOOQ Interactive Workshop Results2017-03-09 OE Global MOOQ Interactive Workshop Results
2017-03-09 OE Global MOOQ Interactive Workshop Results
Christian M. Stracke
 

Viewers also liked (20)

Leveraging big data to drive marketing innovation
Leveraging big data to drive marketing innovationLeveraging big data to drive marketing innovation
Leveraging big data to drive marketing innovation
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
 
IMPROVING ORDER-TO-CASH CYCLE.
IMPROVING ORDER-TO-CASH CYCLE.IMPROVING ORDER-TO-CASH CYCLE.
IMPROVING ORDER-TO-CASH CYCLE.
 
Serpientes extrañas, Victor Madera
Serpientes extrañas, Victor MaderaSerpientes extrañas, Victor Madera
Serpientes extrañas, Victor Madera
 
EL AYUNTAMIENTO ENTREGA 12 NUEVAS VIVIENDAS EN ALQUILER PARA JÓVENES EN LA ZO...
EL AYUNTAMIENTO ENTREGA 12 NUEVAS VIVIENDAS EN ALQUILER PARA JÓVENES EN LA ZO...EL AYUNTAMIENTO ENTREGA 12 NUEVAS VIVIENDAS EN ALQUILER PARA JÓVENES EN LA ZO...
EL AYUNTAMIENTO ENTREGA 12 NUEVAS VIVIENDAS EN ALQUILER PARA JÓVENES EN LA ZO...
 
Managing The Virtualized Enterprise New Technology, New Challenges
Managing The Virtualized Enterprise New Technology, New ChallengesManaging The Virtualized Enterprise New Technology, New Challenges
Managing The Virtualized Enterprise New Technology, New Challenges
 
2017 02-19 Meetup Slides
2017 02-19 Meetup Slides2017 02-19 Meetup Slides
2017 02-19 Meetup Slides
 
Semiotic analysis of saw pp
Semiotic analysis of saw ppSemiotic analysis of saw pp
Semiotic analysis of saw pp
 
Action research on work place conflict and strategy to solve the problem
Action research on work place conflict and strategy to solve the problemAction research on work place conflict and strategy to solve the problem
Action research on work place conflict and strategy to solve the problem
 
Doc1pdf
Doc1pdfDoc1pdf
Doc1pdf
 
Slope Powerpoint
Slope PowerpointSlope Powerpoint
Slope Powerpoint
 
Photo selection (motorway)
Photo selection (motorway)Photo selection (motorway)
Photo selection (motorway)
 
Data management plans (DMPs)- 16 Feb 2017
Data management plans (DMPs)- 16 Feb 2017 Data management plans (DMPs)- 16 Feb 2017
Data management plans (DMPs)- 16 Feb 2017
 
국내외 핀테크(Fintech) 동향과 전망
국내외 핀테크(Fintech) 동향과 전망국내외 핀테크(Fintech) 동향과 전망
국내외 핀테크(Fintech) 동향과 전망
 
주간금융경제동향(우리금융연구소)
주간금융경제동향(우리금융연구소)주간금융경제동향(우리금융연구소)
주간금융경제동향(우리금융연구소)
 
핀테크 산업 트렌드 및 시사점
핀테크 산업 트렌드 및 시사점핀테크 산업 트렌드 및 시사점
핀테크 산업 트렌드 및 시사점
 
Workshop with The Trade Desk, Digiday Agency Summit, March 2017
Workshop with The Trade Desk, Digiday Agency Summit, March 2017Workshop with The Trade Desk, Digiday Agency Summit, March 2017
Workshop with The Trade Desk, Digiday Agency Summit, March 2017
 
Turning obstacles into opportunities, Digiday Agency Summit, March 2017
Turning obstacles into opportunities, Digiday Agency Summit, March 2017Turning obstacles into opportunities, Digiday Agency Summit, March 2017
Turning obstacles into opportunities, Digiday Agency Summit, March 2017
 
Tugas4 0317-nasrulakbar-141250552
Tugas4 0317-nasrulakbar-141250552Tugas4 0317-nasrulakbar-141250552
Tugas4 0317-nasrulakbar-141250552
 
2017-03-09 OE Global MOOQ Interactive Workshop Results
2017-03-09 OE Global MOOQ Interactive Workshop Results2017-03-09 OE Global MOOQ Interactive Workshop Results
2017-03-09 OE Global MOOQ Interactive Workshop Results
 

Similar to Large-Scale Real-Time Data Management for Engagement and Monetization

GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
Narendranath Reddy T
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
Alexei Krasner
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
Tilak Patidar
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
SHIKHA GAUTAM
 
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
Insight Technology, Inc.
 
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
Insight Technology, Inc.
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios
 
ppt
pptppt
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore
mlraviol
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
David Groozman
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
Tomas Sirny
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
David Pilato
 
L21 scalability
L21 scalabilityL21 scalability
L21 scalability
Ólafur Andri Ragnarsson
 

Similar to Large-Scale Real-Time Data Management for Engagement and Monetization (20)

GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
 
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
 
ppt
pptppt
ppt
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
L21 scalability
L21 scalabilityL21 scalability
L21 scalability
 

More from Simon Lia-Jonassen

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
Simon Lia-Jonassen
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
Simon Lia-Jonassen
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
Simon Lia-Jonassen
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
Simon Lia-Jonassen
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 

More from Simon Lia-Jonassen (7)

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 

Recently uploaded

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 

Recently uploaded (20)

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 

Large-Scale Real-Time Data Management for Engagement and Monetization

  • 1. Large-Scale Real-Time Data Management for Engagement and Monetization Simon Lia-Jonassen LSDS-IR 2015
  • 2. Our mission is to help companies understand their audience and build great online user experiences. – Find interesting articles. – Stay longer on the site – Get relevant ads. – Sign up for subscriptions. Some of our customers: About Cxense
  • 5. How does it work!? (JavaScript tag example)
  • 6. Page view events (example)
  • 9.
  • 10. UI and API capabilities
  • 11. UI and API capabilities
  • 13. Data Volume and Traffic (monthly) –  5 000 active Web-sites –  100 million pages –  1 billion users –  15 billion page views Constrains and Requirements –  Online and real-time processing •  Show, analyze and act on what is happening exactly right now. –  High and sustainable performance •  Peak-load 10K+ request/sec. •  50 ms latency constrain for ads and recs. –  Availability, reliability, durability •  multi DC and fault-tolerance –  Security and privacy Challenges
  • 14. Heterogeneity and Reliability –  Hundreds of mobile and desktop platforms, browsers, internet providers, etc. –  Multiple browsers and devices per user, cross-domain tracking (3rd party cookies are dying out). –  Web-pages (articles, image/video galleries, chats, search/front pages) and human language. –  The Internet is Broken™ Customer success –  Providing the right insights. •  Data, metrics and visualization. –  Providing the right set of tools. •  Usability, brevity, expressiveness, completeness. –  Best practices. •  Analytics, ads, recs, user engagement, personalization and subscription optimization. –  Onboarding and support. Challenges
  • 15. Communication –  HTTP with JSON payload. –  Durable and Idempotent. Local storage –  Atomically append to file. –  Use a separate directory for each partition and a new file each hour. –  Tail files and/or directories. Metadata –  Keeps the state. –  Rewind and re-feed when needed. System –  Configured via Upstart and Cron. –  Monitoring via Graphite and log files. –  Automatic alerting. Architecture and Data Flow
  • 16. Data Cubes –  Partitioned column store database. –  Efficient string handling and integer compression. –  Fast filtering and aggregation over billions of data points. –  Low update latency (100ms). –  Exists in multiple variants: •  Disk or memory based. •  Partitioned by site, by user or by both. –  Low-level API. Example: The Cube !me   user   rnd   siteid   url     browser   1409425329634   “4szi”   “xzst”   “9978”   “cxnews.com”   “Chrome”   1409425329634   “zthp”   “fd0z”   “9978”   “cxnews.com/seahawks-­‐win-­‐again…”   “Firefox”   1409425329635   “4szi”   “tzdt”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Chrome”   1409425329640   “4szi”   “aext”   “9978”   “cxnews.com/elon-­‐musk-­‐is-­‐awes…”   “Chrome”   1409425329640   “zx5t”   “dxrf”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Safari”  
  • 17. Frame of Reference Compression –  Compress the numbers in groups of 64. –  If the sequence is increasing – use the first number as the reference and compute the differences between each two consecutive numbers (deltas). –  Find the maximum number of bits (width) needed to represent the larges delta and compress the deltas using fixed bit width. –  For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas. The Cube – Integer Columns
  • 18. –  A global lexicon maps all strings to numbers and back. –  For each column, map global keys to a smaller set of numbers and back. The Cube – String Columns
  • 19. Structured data –  Can represent any simple JSON object (document). –  Node types: Null, Object, Array, Integer, Float, String, Boolean. –  Stored in a separate container, separate columns for each node type. –  Each document is decomposed into a list of paths and nodes. –  Each node is added to the corresponding column. The Cube – Advanced Data Types
  • 20. Filtering operations and tricks: –  Keep a bit-filter over a range of rows (1 = exclude). –  By a number or range – unset bits where numbers not match. Can use binary search for ordered columns such as time, and inverted indexes for unordered such as user id. –  By a key or set of key – map keys to a number or bit-set and filter. –  By pattern – filter by the set of keys matching the pattern. –  Logical AND, OR, NOT – use a stack of filters and binary operations. The Cube – Filtering and Aggregation
  • 21. Some aggregation operations and tricks: –  Count, Sum, Cardinaltiy – bit-counting. Can use HLL for distributed cardinality. –  Frequency, SumBy, CardinalityMap – sorting and bit-counting using pairs of integers. –  Frequency-, SumBy–, CardinalityDistribution – histograms, more sorting and bit-counting. The Cube – Filtering and Aggregation
  • 22. Advanced operations –  Use aggregation output as filtering input (e.g., top-list, histogram, etc.). –  Join between cubes on one or multiple dimensions. The Cube – Filtering and Aggregation
  • 23. Partitioning –  Most of the data structures are partitioned into chunks of data. –  This improves memory allocation, materialization, skipping, compression and locking. Static and dynamic parts –  Each data column, lexicon or mapping consist of a static and a dynamic part. –  The static part is ordered – use binary search and Minimal Perfect Hashing. –  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees. –  Updates are mostly appends, but updates can also be done via deletion and a new write. Maintenance –  Periodically flush the dynamic part into the static part. –  Remove the old data, delete unused strings, optimize the mapping. The Cube – Updates
  • 24. Thank you! Questions? Credits: Erik Gorset and the Oslo R&D Team simon.jonassen@cxense.com …btw,  we  are  hiring!   cxense.com facebook.com/cxense twitter.com/cxense linkedin.com/company/cxense youtube.com/user/cxense
  • 25. One more thing… the Internet of Things!