SlideShare a Scribd company logo
Extending Cassandra with
Doradus OLAP for
High Performance Analytics
Randy Guck
Principal Engineer,
Dell Software
30 Doradus: The Tarantula Nebula
Source: Hubble Space Telescope
Webcast sponsored by:
Agenda
• Doradus Overview
– What is it?
– Why Cassandra?
– Data model and DQL
• Architecture
– Storage services
• Doradus OLAP
– Motivation and sharding model
– Example Events application
– Example queries
Doradus In A Nutshell
• Storage and query service
• Uses NoSQL DB for persistence
• Pure Java
• Stateless
• Full text and graph query language
• Pluggable storage services
• Bundled with Dell UCCS Analytics
for 3 years
• Open source, Apache License 2.0
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and
Log files
Example Multi-Node Cluster
Doradus
Cassandra
Data
Node 2
Cassandra
Node 1
Cassandra
Node 3
Data Data
Doradus
Secondary instances
are optional
Applications
REST API
Thrift or CQL
Cassandra: Best NoSQL DB?
• Choice for many big data apps
• Wide-column model
• CQL: Similar to SQL
• Pure peer architecture
• Cabinet- and Data Center-aware
• Elasticity, recovery features
• Consistent benchmark winner
So, Why Doradus?
• Cassandra limitations–by design:
– Minimal indexing
– No relationship support
– Limited queries
• e.g., SELECT <columns> FROM <table> WHERE KEY IN (<list>)
• No joins, embedded selects, OR clauses, ...
• No full text, range, or statistical queries
– API: requires client driver
• What if we could leverage Cassandra’s best features
while elevating the data model and query language?
Doradus Data Model:
Example Message Tracking Schema
Message
{Size, SendDate, ...}
Participant
{ReceiptDate}
Address
{Email}
Person
{FirstName, LastName,
Department, ...}
Attachment
{Size, Extension} Manager
Employees
Person
Address 
Attachments
Message
Recipients
MessageAsRecipient
Address
Participants
Bi-directional relationships are formed by link pairs
Sender
MessageAsSender
Doradus Query Language:
Object Queries
• Goal:
Find and return objects
• Builds on Lucene syntax:
Full text queries: terms, phrases,
ranges, AND/OR/NOT, etc.
• Adds link paths:
Directed graph searches
Quantifiers and filters
Transitive searches
• Other features:
Stateless paging
Sorting
• Example DQL expressions:
// On Person:
LastName = Smith AND
NOT (FirstName : Jo*) AND
BirthDate = [1986 TO 1992]
// On Message:
ALL(Recipients).ANY(Address.
WHERE(Email='*.gmail.com')).
Person.Department : support
// On Person:
Employees^(4).Office='San Jose'
Doradus Query Language:
Aggregate Queries
• Goal:
Compute statistics
• Metric functions:
COUNT, AVERAGE, MIN,
MAX, DISTINCT, SUM, ...
• Multi-level grouping
• Grouping functions:
BATCH, BOTTOM, FIRST,
LAST, LOWER, SETS, TERMS,
TOP, TRUNCATE, UPPER,
WHERE, ...
• Example queries (on Message):
// Multi-metric function
metric=COUNT(*), AVERAGE(Size),
MIN(Recipients.Address.Person.Birthdate)
// Multi-level grouping
metric=DISTINCT(Attachments.Extension);
groups=Tags,
Recipients.Address.Person.Department;
query=Attachments.Size > 100000
// Grouping function: TOP
metric=AVERAGE(Size);
groups=TOP(10, Recipients.Address.Email)
Doradus Service Architecture
REST
TaskManagerSchema MBean
Storage
OLAPSpider Logging
DB
CQLThrift DynamoDB
Cassandra
DoradusServer
doradus.yaml
REST API
JMX
Doradus Server
Tenant
pluggable
modules
AWS DynamoDB
*
1
Storage Service Comparison
Feature
Storage Service
Spider OLAP Logging
Data variability Unstructured Semi-structured Unstructured
Data mutability High Medium None
Update granularity Fine Batch Batch
Load performance Medium High Very high
Space usage High Low Low
Query focus Object queries Aggregate queries
Time-based
Aggregate queries
Data aging? yes yes yes
Supports links? yes yes no
Dynamic fields? yes no yes
Event load rate/node 1.3K/second 124K/second 444K/second
115M Event DB Size 100GB 1.89GB 1.49GB
Doradus OLAP: Motivation
• Combines ideas from:
– Online Analytical Processing: data arranged in static cubes
– Columnar databases: Column-oriented storage and
compression
– NoSQL databases: Sharding
• Features:
– Fast loading: up to 1M objects/second/node
– Dense storage: 1 billion objects in 2 GB
– Fast cube merging: typically seconds
– No indexes
OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources
OLAP Data Loading
BatchEventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch
Batch
Batch
Batch
Sources Batches
…
OLAP Data Loading
BatchEventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch
Batch
Batch
Batch
2015-03-01
2015-02-28
2015-02-27
Sources Shards
…
…
Merge
Merge
Merge
Batches
OLAP Data Loading
BatchEventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch
Batch
Batch
Batch
2015-03-01
2015-02-28
2015-02-27
Sources Shards OLAP Store
…
…
Merge
Merge
Merge
Batches
OLAP Shard Merging
Message Table
ID ...
Size ...
SendDate ...
...
Person Table
ID ...
FirstName ...
LastName ...
...
Address Table
ID ...
Person ...
Messages ...
...
Message Table
ID ...
Size ...
SendDate ...
...
Person Table
ID ...
FirstName ...
LastName ...
...
Address Table
ID ...
Person ...
Messages ...
...
Key Columns
Email/Message/2014-03-01/ID [compressed data]
Email/Message/2014-03-01/Size [compressed data]
Email/Message/2014-03-01/SendDate [compressed data]
... ...
Email/Person/2014-03-01/ID [compressed data]
Email/Person/2014-03-01/FirstName [compressed data]
Email/Person/2014-03-01/LastName [compressed data]
... ...
Email/Address/2014-03-01/ID [compressed data]
Email/Address/2014-03-01/Person [compressed data]
Email/Address/2014-03-01/Message [compressed data]
... ...
Email/Message/2014-02-28/ID [compressed data]
...
Batch #1: Shard 2014-03-01
Batch #2: Shard 2014-03-01
OLAP Store
Does Merging Take Long?
-
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
45,000,000
0 10 20 30 40 50 60 70 80 90
ShardSize(totalobjects)
Merge Time (seconds)
Merge Time vs. Shard Size
OLAP Query Execution
• Example query:
– Count messages with Size between 1000-10000 and
HasBeenSent=false in shards 2014-03-01 to 2014-03-31
• How many rows are read?
– 2 fields x 31 shards = 62 rows
– Each row typically represents millions of objects
• Value arrays are scanned in memory
• Physical rows are read on “cold” start only
– Multiple caching levels create “hot” and “warm” data
pools
OLAP Example: Security Events
Sample event in CSV format:
MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,
"Logon/Logoff",540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,
"(0x0,0x142999A)",3,Kerberos,Kerberos
Fixed Fields Variable Fields
Computer Name MAILSERVER18 1
MAILSERVER18
$
Log Name Security 2
Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation
Type Success Audit 4 (0x0,0x142999A)
Source Security 5 3
Category Logon/Logoff 6 Kerberos
Event ID 540 7 Kerberos
User Domain NT AUTHORITY
User Name SYSTEM
User SID S-1-5-18
OLAP Example: Events Schema
Events
Insertion
Strings
Params
Event
Count: 115 Million Count: 880 Million
Fields:
• ComputerName (text)
• LogName (text)
• Timestamp (timestamp)
• Type (text)
• Source (text)
• Category (text)
• EventID (integer)
• UserDomain (text)
• UserSID (text)
• Params (link)
Fields:
• Index (integer)
• Value (text)
• Event (link)
OLAP Example: Events Loading
• Configuration:
– Single Dell PowerEdge™ server with dual Intel® Xeon® CPUs
– Cassandra: Xmx=2G, Doradus: Xmx=8G
• Load stats (15 threads):
Total shards: 860
Total events: 114,572,247
Total ins strings: 879,529,753
Total objects: 994,102,000
Total load time: 15 minutes, 27 seconds
Average load rate: 1.1M objects/second; 124K events/second
• Space usage:
:nodetool -h localhost status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 127.0.0.1 1.89 GB 100.0% 9fe241b5-20f7-4afb-af92-207ef24b8095 -9176223118562734495 rack1
OLAP Example: Events Queries
• Count all Events in all shards
• REST command:
http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=0
• Example response:
<results>
<aggregate metric="COUNT(*)" query="*"/>
<totalobjects>114572247</totalobjects>
<value>114572247</value>
</results>
application
name
table
name
aggregate query
resource
all shardsmetric
function
OLAP Example: Events Queries
• Find the top 5 hours-of-the-day when certain
privileged events fail:
Event IDs are any of 577, 681, 529
Event type is ‘Failure Audit’
Insertion string 8 is (0x0,0x3E7)
Event occurred in first half of 2005 (181 shards)
GET /OLAPEvents/Events/_aggregate?m=COUNT(*)
&range=2005-01-01,2005-06-30
&q=Type='Failure Audit' AND EventID IN (577,681,529) AND
Params.WHERE(Index=8).Value='(0x0,0x3E7)'
&f=TOP(5,Timestamp.HOUR) AS HourOfDay
“Privileged Event Failures”:
Query Results
<results>
<aggregate metric="COUNT(*)" query="Type='Failure Audit' EventID IN (577,681,529) AND
Params.Index=8 AND Params.Value='(0x0,0x3E7)'" group="TOP(5,Timestamp.HOUR) AS HourOfDay"/>
<totalobjects>591</totalobjects>
<summary>591</summary>
<totalgroups>17</totalgroups>
<groups>
<group>
<metric>119</metric>
<field name="HourOfDay">8</field>
</group>
<group>
<metric>87</metric>
<field name="HourOfDay">18</field>
</group>
<group>
<metric>72</metric>
<field name="HourOfDay">19</field>
</group>
<group>
<metric>69</metric>
<field name="HourOfDay">9</field>
</group>
<group>
<metric>66</metric>
<field name="HourOfDay">5</field>
</group>
</groups>
</results>
Most failures
occur at 08:00
Doradus OLAP Summary
• Advantages:
 Simple REST API
 All fields are searchable
without indexes
 Ad-hoc statistical searches
 Support for graph-based
queries
 Near real time data
warehousing
 Dense storage = less
hardware
 Horizontally scalable when
needed
• Good for applications
where data:
 Is continuous/streaming
 Is structured to semi-
structured
 Can be loaded in batches
 Is partitionable, typically by
time
 Is typically queried in a
subset of shards
 Emphasizes statistical
queries
Thank You!
• Where to find Doradus
– Source and docs: github.com/dell-oss/Doradus
– Downloads: search.maven.org
• Contact me
– Randy.Guck@software.dell.com
– @randyguck
• Thanks to our sponsors!
30 Doradus R136 Cluster
Source: Hubble Space Telescope
PowerEdge is a trademark of Dell Inc.
Intel and Xeon are trademarks of Intel Corporation
in the U.S. and/or other countries.

More Related Content

What's hot

3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
Mathias Herberts
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
Hirokazu Tokuno
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
Cheng Lian
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDB
Amazon Web Services
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
nickmbailey
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
Amazon Web Services
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
Duyhai Doan
 
Amazon DynamoDB Workshop
Amazon DynamoDB WorkshopAmazon DynamoDB Workshop
Amazon DynamoDB Workshop
Amazon Web Services
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ ING
Duyhai Doan
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
Cheng Min Chi
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
Duyhai Doan
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 

What's hot (20)

3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
 
Design Patterns using Amazon DynamoDB
 Design Patterns using Amazon DynamoDB Design Patterns using Amazon DynamoDB
Design Patterns using Amazon DynamoDB
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
Amazon DynamoDB Workshop
Amazon DynamoDB WorkshopAmazon DynamoDB Workshop
Amazon DynamoDB Workshop
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ ING
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 

Similar to Extending Cassandra with Doradus OLAP for High Performance Analytics

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
randyguck
 
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
Andrew Liu
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Presentation
PresentationPresentation
Presentation
Dimitris Stripelis
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
ajay_ei
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
Victoria Malaya
 
Spark Introduction
Spark IntroductionSpark Introduction
Spark Introduction
DataStax Academy
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
b0ris_1
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
Victor Giannakouris
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
Jim Mlodgenski
 
managing big data
managing big datamanaging big data
managing big data
Suveeksha
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
Anuj Sahni
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra Clusters
Instaclustr
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 

Similar to Extending Cassandra with Doradus OLAP for High Performance Analytics (20)

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Presentation
PresentationPresentation
Presentation
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Spark Introduction
Spark IntroductionSpark Introduction
Spark Introduction
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
managing big data
managing big datamanaging big data
managing big data
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra Clusters
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

Recently uploaded

My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 

Recently uploaded (20)

My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 

Extending Cassandra with Doradus OLAP for High Performance Analytics

  • 1. Extending Cassandra with Doradus OLAP for High Performance Analytics Randy Guck Principal Engineer, Dell Software 30 Doradus: The Tarantula Nebula Source: Hubble Space Telescope Webcast sponsored by:
  • 2. Agenda • Doradus Overview – What is it? – Why Cassandra? – Data model and DQL • Architecture – Storage services • Doradus OLAP – Motivation and sharding model – Example Events application – Example queries
  • 3. Doradus In A Nutshell • Storage and query service • Uses NoSQL DB for persistence • Pure Java • Stateless • Full text and graph query language • Pluggable storage services • Bundled with Dell UCCS Analytics for 3 years • Open source, Apache License 2.0 Application Doradus Cassandra REST API Thrift or CQL Data and Log files
  • 4. Example Multi-Node Cluster Doradus Cassandra Data Node 2 Cassandra Node 1 Cassandra Node 3 Data Data Doradus Secondary instances are optional Applications REST API Thrift or CQL
  • 5. Cassandra: Best NoSQL DB? • Choice for many big data apps • Wide-column model • CQL: Similar to SQL • Pure peer architecture • Cabinet- and Data Center-aware • Elasticity, recovery features • Consistent benchmark winner
  • 6. So, Why Doradus? • Cassandra limitations–by design: – Minimal indexing – No relationship support – Limited queries • e.g., SELECT <columns> FROM <table> WHERE KEY IN (<list>) • No joins, embedded selects, OR clauses, ... • No full text, range, or statistical queries – API: requires client driver • What if we could leverage Cassandra’s best features while elevating the data model and query language?
  • 7. Doradus Data Model: Example Message Tracking Schema Message {Size, SendDate, ...} Participant {ReceiptDate} Address {Email} Person {FirstName, LastName, Department, ...} Attachment {Size, Extension} Manager Employees Person Address  Attachments Message Recipients MessageAsRecipient Address Participants Bi-directional relationships are formed by link pairs Sender MessageAsSender
  • 8. Doradus Query Language: Object Queries • Goal: Find and return objects • Builds on Lucene syntax: Full text queries: terms, phrases, ranges, AND/OR/NOT, etc. • Adds link paths: Directed graph searches Quantifiers and filters Transitive searches • Other features: Stateless paging Sorting • Example DQL expressions: // On Person: LastName = Smith AND NOT (FirstName : Jo*) AND BirthDate = [1986 TO 1992] // On Message: ALL(Recipients).ANY(Address. WHERE(Email='*.gmail.com')). Person.Department : support // On Person: Employees^(4).Office='San Jose'
  • 9. Doradus Query Language: Aggregate Queries • Goal: Compute statistics • Metric functions: COUNT, AVERAGE, MIN, MAX, DISTINCT, SUM, ... • Multi-level grouping • Grouping functions: BATCH, BOTTOM, FIRST, LAST, LOWER, SETS, TERMS, TOP, TRUNCATE, UPPER, WHERE, ... • Example queries (on Message): // Multi-metric function metric=COUNT(*), AVERAGE(Size), MIN(Recipients.Address.Person.Birthdate) // Multi-level grouping metric=DISTINCT(Attachments.Extension); groups=Tags, Recipients.Address.Person.Department; query=Attachments.Size > 100000 // Grouping function: TOP metric=AVERAGE(Size); groups=TOP(10, Recipients.Address.Email)
  • 10. Doradus Service Architecture REST TaskManagerSchema MBean Storage OLAPSpider Logging DB CQLThrift DynamoDB Cassandra DoradusServer doradus.yaml REST API JMX Doradus Server Tenant pluggable modules AWS DynamoDB * 1
  • 11. Storage Service Comparison Feature Storage Service Spider OLAP Logging Data variability Unstructured Semi-structured Unstructured Data mutability High Medium None Update granularity Fine Batch Batch Load performance Medium High Very high Space usage High Low Low Query focus Object queries Aggregate queries Time-based Aggregate queries Data aging? yes yes yes Supports links? yes yes no Dynamic fields? yes no yes Event load rate/node 1.3K/second 124K/second 444K/second 115M Event DB Size 100GB 1.89GB 1.49GB
  • 12. Doradus OLAP: Motivation • Combines ideas from: – Online Analytical Processing: data arranged in static cubes – Columnar databases: Column-oriented storage and compression – NoSQL databases: Sharding • Features: – Fast loading: up to 1M objects/second/node – Dense storage: 1 billion objects in 2 GB – Fast cube merging: typically seconds – No indexes
  • 17. OLAP Shard Merging Message Table ID ... Size ... SendDate ... ... Person Table ID ... FirstName ... LastName ... ... Address Table ID ... Person ... Messages ... ... Message Table ID ... Size ... SendDate ... ... Person Table ID ... FirstName ... LastName ... ... Address Table ID ... Person ... Messages ... ... Key Columns Email/Message/2014-03-01/ID [compressed data] Email/Message/2014-03-01/Size [compressed data] Email/Message/2014-03-01/SendDate [compressed data] ... ... Email/Person/2014-03-01/ID [compressed data] Email/Person/2014-03-01/FirstName [compressed data] Email/Person/2014-03-01/LastName [compressed data] ... ... Email/Address/2014-03-01/ID [compressed data] Email/Address/2014-03-01/Person [compressed data] Email/Address/2014-03-01/Message [compressed data] ... ... Email/Message/2014-02-28/ID [compressed data] ... Batch #1: Shard 2014-03-01 Batch #2: Shard 2014-03-01 OLAP Store
  • 18. Does Merging Take Long? - 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 45,000,000 0 10 20 30 40 50 60 70 80 90 ShardSize(totalobjects) Merge Time (seconds) Merge Time vs. Shard Size
  • 19. OLAP Query Execution • Example query: – Count messages with Size between 1000-10000 and HasBeenSent=false in shards 2014-03-01 to 2014-03-31 • How many rows are read? – 2 fields x 31 shards = 62 rows – Each row typically represents millions of objects • Value arrays are scanned in memory • Physical rows are read on “cold” start only – Multiple caching levels create “hot” and “warm” data pools
  • 20. OLAP Example: Security Events Sample event in CSV format: MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security, "Logon/Logoff",540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation, "(0x0,0x142999A)",3,Kerberos,Kerberos Fixed Fields Variable Fields Computer Name MAILSERVER18 1 MAILSERVER18 $ Log Name Security 2 Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation Type Success Audit 4 (0x0,0x142999A) Source Security 5 3 Category Logon/Logoff 6 Kerberos Event ID 540 7 Kerberos User Domain NT AUTHORITY User Name SYSTEM User SID S-1-5-18
  • 21. OLAP Example: Events Schema Events Insertion Strings Params Event Count: 115 Million Count: 880 Million Fields: • ComputerName (text) • LogName (text) • Timestamp (timestamp) • Type (text) • Source (text) • Category (text) • EventID (integer) • UserDomain (text) • UserSID (text) • Params (link) Fields: • Index (integer) • Value (text) • Event (link)
  • 22. OLAP Example: Events Loading • Configuration: – Single Dell PowerEdge™ server with dual Intel® Xeon® CPUs – Cassandra: Xmx=2G, Doradus: Xmx=8G • Load stats (15 threads): Total shards: 860 Total events: 114,572,247 Total ins strings: 879,529,753 Total objects: 994,102,000 Total load time: 15 minutes, 27 seconds Average load rate: 1.1M objects/second; 124K events/second • Space usage: :nodetool -h localhost status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns Host ID Token Rack UN 127.0.0.1 1.89 GB 100.0% 9fe241b5-20f7-4afb-af92-207ef24b8095 -9176223118562734495 rack1
  • 23. OLAP Example: Events Queries • Count all Events in all shards • REST command: http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=0 • Example response: <results> <aggregate metric="COUNT(*)" query="*"/> <totalobjects>114572247</totalobjects> <value>114572247</value> </results> application name table name aggregate query resource all shardsmetric function
  • 24. OLAP Example: Events Queries • Find the top 5 hours-of-the-day when certain privileged events fail: Event IDs are any of 577, 681, 529 Event type is ‘Failure Audit’ Insertion string 8 is (0x0,0x3E7) Event occurred in first half of 2005 (181 shards) GET /OLAPEvents/Events/_aggregate?m=COUNT(*) &range=2005-01-01,2005-06-30 &q=Type='Failure Audit' AND EventID IN (577,681,529) AND Params.WHERE(Index=8).Value='(0x0,0x3E7)' &f=TOP(5,Timestamp.HOUR) AS HourOfDay
  • 25. “Privileged Event Failures”: Query Results <results> <aggregate metric="COUNT(*)" query="Type='Failure Audit' EventID IN (577,681,529) AND Params.Index=8 AND Params.Value='(0x0,0x3E7)'" group="TOP(5,Timestamp.HOUR) AS HourOfDay"/> <totalobjects>591</totalobjects> <summary>591</summary> <totalgroups>17</totalgroups> <groups> <group> <metric>119</metric> <field name="HourOfDay">8</field> </group> <group> <metric>87</metric> <field name="HourOfDay">18</field> </group> <group> <metric>72</metric> <field name="HourOfDay">19</field> </group> <group> <metric>69</metric> <field name="HourOfDay">9</field> </group> <group> <metric>66</metric> <field name="HourOfDay">5</field> </group> </groups> </results> Most failures occur at 08:00
  • 26. Doradus OLAP Summary • Advantages:  Simple REST API  All fields are searchable without indexes  Ad-hoc statistical searches  Support for graph-based queries  Near real time data warehousing  Dense storage = less hardware  Horizontally scalable when needed • Good for applications where data:  Is continuous/streaming  Is structured to semi- structured  Can be loaded in batches  Is partitionable, typically by time  Is typically queried in a subset of shards  Emphasizes statistical queries
  • 27. Thank You! • Where to find Doradus – Source and docs: github.com/dell-oss/Doradus – Downloads: search.maven.org • Contact me – Randy.Guck@software.dell.com – @randyguck • Thanks to our sponsors! 30 Doradus R136 Cluster Source: Hubble Space Telescope PowerEdge is a trademark of Dell Inc. Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

Editor's Notes

  1. This presentation describes Doradus OLAP, which is a query and storage service that extends the Cassandra NoSQL database. The name was derived from 30 Doradus, also known as The Tarantula Nebula, which is a region in the Large Magellanic Cloud. It is a very active and luminous region, serving as the birthplace for new galaxies and bright stars.
  2. Here are the topics covered in this presentation. First we’ll describe Doradus at a broad level including it’s main features, how it leverages Cassandra, and its data model and query language. We’ll then touch on the internal architecture and the concept of storage services. Finally, we’ll take a close look at the Doradus OLAP service and how it provides near real time data warehousing. We’ll use an example Events application to demonstrate load and storage characteristics and sample queries.
  3. Here is a quick overview of Doradus. From a high-level viewpoint, it is a storage and query service that leverages a NoSQL database for persistence. Doradus primarily uses Cassandra, but integrations with other persistence stores have also been demonstrated. Doradus provides a high-level data model and a query language that supports full text and graph searching features. Applications communicate with Doradus via a REST API that supports JSON and XML messages. Doradus is stateless, so multiple instances can be run against the same Cassandra cluster using either the Thrift or CQL API. Like Cassandra, Doradus is pure Java and runs on Windows and Linux. The architecture diagram shows a minimal implementation consisting of a user application, Doradus instance, and Cassandra instance, all of which can run on a single machine. The Doradus architecture supports pluggable storage services, which organize data for specific application scenarios with different space/performance tradeoffs. One of those services, Doradus OLAP, is the primary focus of this presentation. Because of its dense storage techniques, Doradus OLAP allows many big data applications to run on a single node. However, the Cassandra NoSQL underpinning allows horizontal expansion when needed. Doradus has been bundled with the Dell UCCS Analytics product for about 3 years now and is available as open source under the Apache License 2.0. This means the source code is free for anyone to download, use, and even modify.
  4. Both Doradus and Cassandra can be scaled horizontally. When increased scalability or failover are needed, a multi-node Cassandra cluster can be used. In this example, Cassandra is deployed in a 3-node cluster. Technically, only one Doradus instance is needed: it will rotate requests through all Cassandra nodes and ignore failed nodes automatically. Because Doradus is stateless, you can run additional instances for increased capacity or failover. For simplicity, one approach is to run a Doradus instance on each Cassandra node so that every node is configured identically. From Cassandra’s viewpoint, Doradus is an application. Doradus adds functionality from the “outside” and does not modify core Cassandra code. Doradus has been tested with both Cassandra 2.0 and 2.1 releases.
  5. There are many good NoSQL DBs available today, so why did we choose Cassandra? Here are some of the reasons: Cassandra is the choice of many companies for serious big data applications, including Netflix, EBay, and Apple. At last count, the Netflix cluster uses over 2,800 nodes. Apple reportedly has deployed over 75,000 Cassandra nodes that manage over 10PB of data. Cassandra’s wide-column or tabular data model is extremely flexible. It can be used for simple key/value and document-oriented applications, but it supports up to 2 billion columns per row, and a column value can be up to 2GB in length. The primary query language for Cassandra is CQL, which borrows from SQL for both DDL (schema) and DML (query) operations. This makes it easier to transition from relational databases. Cassandra is pure Java and uses a “pure peer” architecture: there is only one process type, and requests can be sent to any node. There are no master/secondary processes, which make Cassandra easy to deploy and extend. Cassandra has arguably best-in-class features for horizontal scalability. It is cabinet- and data center-aware, allowing it to choose replication strategies that maximize availability while balancing network usage. When it comes to managing a distributed cluster, again Cassandra has best-in-class features. New nodes can be added dynamically while the cluster rebalances in the background. Lost nodes can be recovered from replicated data, and Cassandra handles complex problems such as “split brain” failures. Cassandra is a consistent winner of NoSQL benchmarks. Most NoSQL DBs are easy to install and get started with. But as data and network complexity grow, Cassandra is arguably the best NoSQL database. This is why many companies choose it and why we use it with Doradus.
  6. If Cassandra is so popular, why do we need Doradus? Many applications find that Cassandra provides exactly what they need. But some applications need more than Cassandra’s basic features and must add functionality in application code. By design, Cassandra supports limited queries and indexing to encourage applications to follow it’s partitioned-based access pattern. Here are some examples when additional functionality is needed: Indexing: Cassandra supports minimal indexing. Secondary indexes are hash structures that only support equality searches. Furthermore, they are only recommended for “low cardinality” scalar fields, and some experts don’t recommend them at all. Data model: Cassandra supports single-valued and limited multi-valued scalars. It does not directly support any means for relating data objects. Queries: By design, CQL does not support joins, subqueries, or aggregation. The most complex select query it supports uses an IN clause. There is no support for full text queries, general inequalities, or statistical queries. API: Cassandra must be accessed via its Thrift or CQL APIs, and both require a client-side driver. Some applications find this limiting and would prefer to use a REST API. What if we could leverage Cassandra’s best features—scalability, elasticity, recoverability—while elevating the data model and query language? This is the premise on which Doradus was started.
  7. Let’s look how Doradus extends Cassandra, starting with the data model and an example Message Tracking application. This schema uses 5 tables to store information about email message traffic. The objects within each table hold scalar field values such as Size and SendDate, but they can also define link fields, which form bi-directional relationships. Every link field has an inverse that defines the same relationship from the opposite direction. For example, the Message table defines a link called Recipients, which points to the Participant table. The inverse link is called MessageAsRecipient, which points back to the Message table. Another pair of links, Sender and MessageAsSender, form another relationship between the same tables. Doradus maintains inverse links automatically: if Message.Sender is updated, the inverse Participant.MessageAsSender link is automatically added or deleted as needed. The Person table defines a link called Manager, which points to Person with Employees as the inverse link. The Manager/Employees relationship forms an org chart graph that can be searched “up” or “down”. A link whose extent—the table it points to—is the same as its owner is called a reflexive link. Reflexive links can be searched recursively via transitive functions. Doradus also allows self-reflexive links that are their own inverse. An example self-reflexive relationship is Friends (though friendship is not always reciprocal!) We’ll see how links are used in the query language next.
  8. The Doradus Query Language (DQL) can be used in two contexts: An object query selects and returns objects from a specific table, called the perspective. Object queries can return fields from both selected objects and objects linked to them. DQL is based on and extends the Lucene full text language. This means that DQL supports clause types such as terms, phrases, wildcards, ranges, equalities and inequalities, AND/OR/NOT and parentheses, and more. To these full text concepts DQL adds the notion of link paths to traverse relationship graphs. Link paths can use functions such as quantifiers and filters to narrow the selection of objects. Reflexive relationships can use a transitive function to search a graph recursively. DQL queries also support stateless paging and results sorting. The first example on the right is a simple full text query, fetching objects from the Person table. It consists of three clauses that select objects (1) whose LastName is Smith, (2) whose FirstName does not contain a term that begins with “Jo”, and (3) whose BirthDate falls between the years 1986 and 1992 (inclusively). This example uses several features including an equality clause, a term clause, a wildcard term, a range clause, and clause negation. The second example queries the Message and demonstrates how link fields are used. Link paths are one of the features that makes DQL easy to use. Though deceptively simple, each “.” in a link path such as X.Y.Z essentially performs a join. This example shows how link paths are explicitly quantified. When a link path is enclosed in the quantifier ANY, ALL, or NONE, a specific number of the objects in the quantified link path must match the clause’s condition. In this example, ALL(Recipients) means that every Recipients value must meet the remainder of the clause. The Address link is quantified with ANY, and it is also filtered with a WHERE clause. This means that only Address objects whose Email ends with “gmail.com” are selected, and at least one of them (ANY) must meet the remainder of the clause. Without the quantifiers and filter, the link path is Recipients.Address.Person.Department, and it uses a contains clause searching for the term support. This means the Department field must contain the term “support” (case-insensitive). Textually, this query selects objects where all Recipients have at least one Address whose Email ends with “gmail.com” and whose linked Person’s department is in support. The last example demonstrates transitive searches using the Person table. Because Employees is a reflexive link, we can search it recursively through the graph, in this case “down” the org chart. The carat (^) is the transitive function and causes the preceding link, Employees, to be searched recursively. The optional value in parentheses after the carat limits the recursion to a maximum number of levels. In this case, ^(4) says “recurse the link to no more than 4 levels”. Without the limit parameter, the transitive function searches the graph until a cycle or a “leaf” object is found.
  9. The other context in which DQL can be used is aggregate queries, which select objects from a perspective table and performs statistical computations across those objects. This slide shows several example aggregate queries. These examples demonstrate the metric expression, grouping expression, and query expression components of aggregate queries. All examples use the Message table as the perspective. The first example performs three computations across selected objects: (1) a COUNT of all objects, (2) the AVERAGE value of the Size field, and (2) the smallest (MIN) Birthdate found among objects linked via Recipients.Address.Person. All three statistical computations are made in a single pass through the data. Since this example contains no grouping expression, a single value is returned for each of these computations. The second example demonstrates multi-level grouping, which divides objects into groups and performs the metric computations across each subgroup. In this example, a single metric function is computed: the unique (DISTINCT) Attachments.Extension values within each group. The grouping expression groups objects first by their Tags field and secondarily by the values found in the link path Recipients.Address.Person.Department. Because this example has a query expression, only those objects matching the selection expression are included in the aggregate computation. The third example computes AVERAGE Size for all objects, grouped by a single-level grouping expression Recipients.Address.Email. The grouping expression is enclosed in a TOP(10) function, which means that only the groups with the top 10 metric values (average Size) are returned. Doradus supports a wide range of grouping functions to create aggregate query groups. Some of the more popular functions are summarized below: BATCH: creates groups from explicit value ranges TOP/BOTTOM: returns a limited number of groups based on their highest/lowest metric values FIRST/LAST: returns a limited number of groups based on their highest/lowest alphabetical group names INCLUDE/EXCLUDE: includes or excludes specific group values within a grouping level UPPER/LOWER: Creates groups with case-insensitive text values SETS: creates groups from arbitrary query expressions TERMS: creates one group per term within a text field TRUNCATE: creates groups from a timestamp value rounded down to a specific date/time granularity See the Doradus documentation for a full description of all aggregate query grouping functions.
  10. Here we look at the Doradus internal architecture. The major internal building blocks are called services. Some services such as the Schema service are required, whereas some services such as Task Manager and REST services can be disabled when not needed. This allows “skinny” execution for embedded scenarios such as bulk load applications. Services are controlled via parameters specified in the doradus.yaml configuration file. All parameters can also be controlled at runtime via command-line arguments. The core DoradusServer component handles the startup and shutdown of all services. Two service types are abstract, allowing configurable, concrete services to be dynamically selected. These pluggable service types are: Storage: Storage services are responsible for mapping update and query requests to low-level rows and columns. That is, storage services organize data using techniques that are optimal for specific application scenarios. Three primary storage services are currently available: Spider, OLAP, and Logging, which are discussed in the next slide. But we are always experimenting with new services. Multiple storage services can be used at the same time in a single Doradus instance. DB: The DB service maps row and column read/write requests to a specific underlying persistence API. For access to Cassandra, both Thrift and CQL services are available: Thrift is older but more performant; CQL has more advanced failover features. Very recently, a DynamoDB implementation was added to allow data to be persisted using Amazon Web Services (AWS) DynamoDB. Experimentally, we have also created DB service implementations that store data in flat files, SQL databases, and memory-only storage. In a given Doradus instance, only a single DB service type can be used. Below is a summary of the other services currently used by Doradus: REST: This service processes REST API requests using the Java servlet API. A Jetty server is bundled with Doradus and is the default web server, but Doradus can be hosted in another servlet-based web server such as Tomcat. The REST service is optional and can be disabled for embedded applications. Schema: This service is responsible for processing schema requests such as initializing new schemas (called applications) and deleting existing ones. This service is required. Task Manager: This service manages background tasks such as data aging and automatic merging (for OLAP). In a multi-instance cluster, tasks are distributed among all Doradus instances. The Task Manager can be disabled for embedded applications. MBean: This service collects statistics on REST command usage and other metrics and makes them available via JMX. This service can be disabled when these metrics are not needed. Tenant: This service provides multi-tenant features. It is a required service.
  11. This slide highlights the feature differences between the Doradus storage services. Doradus Spider is the first storage service we developed. It is still the most flexible service because it supports highly mutable data with fine-grained updates. Spider stores each object in a separate row and adds customizable indexing for each field. It’s focus is object queries that require full text queries. Although Spider is the most flexible service, it is intended for moderate-sized database (millions of objects) and moderate load rates (1,000’s to 10’s of thousands of objects/second). It emphasizes object queries selected by full text query features. Doradus OLAP is our workhorse and most mature storage service. It uses columnar storage and other techniques to compact data from many objects into a single row. OLAP has more restrictions that Spider: for example, all data fields must be pre-defined and data must be loaded in batches. However, OLAP yields very fast loading, high performance analytic queries, and dense storage usage. Doradus Logging is our newest storage service intended for time-series log data. The Logging service has more restrictions than OLAP: for example, data cannot be modified once added, and link fields are not supported. However, the Logging service yields even faster loading than OLAP, slightly denser storage, and fast time-based analytic queries. This slide compares the load and storage characteristics of the three storage services using a common data set consisting of 115 million events: Spider takes 24 hours to load this data and requires over 100GB of storage. OLAP loads the same data in only 15 minutes and creates a database that uses less than 2GB of space. Logging loads the same data in only 5 minutes and uses even less space.
  12. Now let’s look at Doradus OLAP more closely, starting with what motivated it. After we created Doradus Spider, our primary client application changed focus from full text object queries to statistical queries. The variability of search criteria meant that traditional indexing would not be sufficient. We needed to scan millions of objects per second, which was not possible using traditional storage techniques. Some out of the box thinking resulted in Doradus OLAP, which borrows from several disciplines: Online Analytical Processing: OLAP is widely used in data warehouses, where data is transformed and stored in n-dimensional cubes. Columnar databases: These databases store column values instead of row values in the same physical record. NoSQL databases: NoSQL DBs commonly use sharding to co-locate data, especially time-series data. Doradus OLAP combines ideas from these areas to provide a new storage service that is suitable for certain kinds of applications. Compared to the Spider storage manager, OLAP imposes some restrictions. But for applications that can use it, OLAP offers many advantages including: Fast data loading: We’ve observed loads up to 1M objects/second on a single node in bulk-load scenarios. Dense storage: Data is stored in a way that allows makes it highly compressible. Fast merging: One of the typical drawbacks of data warehouses is the time required to update the database with new data. With Doradus OLAP, data is stored in cubes that can be updated quickly, often in a few seconds. To accomplish this, Doradus OLAP uses no indexes. Note that OLAP is not a memory database: data can be orders of magnitude larger than available memory.
  13. Let’s look at how data loading works with Doradus OLAP. The data loading sequence also highlights the criteria that applications must meet to effectively load data. As with most database applications, there may be multiple data sources each with their own “velocity”. That means that some data may be generated quickly, perhaps continuously, whereas other data may change infrequently. Events, for example, may be collected in a continuous stream, whereas information about People may be collected from a directory server via a daily snapshot. It is up to the application to decide how often data is collected and loaded.
  14. Doradus OLAP requires data to be loaded in batches. A single batch can mix new, modified, and deleted objects, and a batch can contain partial updates such as adding or modifying a single field. Each batch can update data in multiple tables. The ideal batch size depends on many factors including how many fields are updated, but tests show that good batch sizes are at least a few hundred objects up to many thousands of objects.
  15. As each batch is loaded, it identifies the shard to which it belongs. A shard is a partition of the database and can be visualized as a cube. A shard is typically a snapshot of the database for a specific time period. The most common shard granularity is “one day”, though finer and coarser shard granularities will be ideal for some applications. Each shard has a name: the name is not meaningful to Doradus, but the shard names are considered ordered. So, if you want to query a specific range of day shards, say March 1st through March 31st, a good idea is to name shards using the format YYYY-MM-DD. Then you can query for the shard range 2015-03-01 to 2015-03-31 and the appropriate shards are selected. When a batch is loaded, is it not immediately visible to the shard, which means its data is not returned in queries. Periodically, the shard must be merged, which causes its pending batches to be applied. You can think of a shard’s queryable data as the “live cube”: merging applies updates from batches, creating a new live cube. After merging, update batches are deleted. Merging can be requested manually or performed automatically in the background based on a schedule. The frequency with which shards are merged affects the latency of the data. If shards are merged once per hour, queries will reflect data that is up to 1 hour old. Merging more often yields fresher data but incurs more resources.
  16. As each shard is updated and merged, a “cube” is added to the OLAP store, which is a Cassandra ColumnFamily. A typical OLAP application is expected to have 100’s to 1000’s of shards. Inside, a shard consists of arrays that are designed for fast merging and fast loading during queries. This is a critical component of Doradus OLAP, so let’s take a closer look.
  17. When batches are loaded, data is extracted and stored in minimally-sized arrays. For example, if an integer field is found to have a maximum value of 50,000, then a 2-byte array is used. Boolean fields use a bit array. Text values are de-duped and stored case-insensitively. Each value array is sorted in object ID order and then stored as a compressed row in Cassandra. Because each array contains homogeneous data (e.g., all integers), it compresses extremely well, often 90% or better. Rows that are too small to warrant compression are stored uncompressed. Multiple, small rows are joined into single large rows so that all values can be loaded with a single read. The idea is to use as little physical disk space as possible to allow a large number of values to be loaded at once. When a shard is merged, all updates are combined with the shard’s current live data to create a new live data set. Since batch loading generates sorted arrays, the merge process consists of heap merging all arrays of the same table and field into a new array. Heap merging is very fast, hence merging does not take as long.
  18. In traditional data warehouse technology, the extract-transform-load (ETL) process is typically very lengthy, sometimes many hours. Since a Doradus OLAP shard is intended to hold millions of objects, you might therefore assume that the merge process takes a long time. However, the merge process is designed to be fast, often only a few seconds. This graph shows the merge time required for an event tracking application (described later). In this load, 860 shards were loaded varying in size. The time to merge each shard with 100,000 objects or more is plotted against the time to merge the shard. As the graph shows, the merge time is directly proportional to the number of objects in the shard. For this application, the longest time to merge a shard was just over 80 seconds: that shard contained over 37 million objects. (Keep in mind that merge time is also affected by the number of tables and fields being merged, so your mileage may vary.) Quick merge time means that Doradus OLAP allows data to be added and merged fairly often, allowing queries to access data that is close to real time.
  19. Since Doradus OLAP uses no indexes, how are queries efficient? This slide shows how OLAP arrays are accessed during queries. The query searches the Message table and counts objects whose Size field falls between 1,000 and 10,000 and whose HasBeenSent flag is false. Furthermore, the query searches shards whose name falls between 2014-03-01 and 2014-03-31. Assuming we used 1-day shards, this requires searching 31 shards. Since the query accesses two fields, OLAP must read 62 rows: the value array for each field in each shard. This might sound like a lot of reads for a single query, but remember that shards typically contain a large set of objects. If our shards averaged 1 million objects each, this query scans 31 million objects with only 62 reads! As each array is read, it is decompressed and scanned in memory. Value arrays are designed for fast scanning: modern processors can typically scan 10’s of millions of values per second. When a value array is scanned, an “object bit array” is generated to reflect the selected objects. Each additional value array scan turns on or off bits; the final bit array represents the results of the query. On a “cold” system, where nothing is cached, each row read requires a physical read. However, in practice data is cached at multiple levels: The operating system caches Cassandra data files (SSTables). Since data is compressed, a large amount of data is typically cached at this level. Cassandra caches certain information such as recently-read rows, key indices, and bloom filters to speed-up access to recently-accessed data. Doradus OLAP caches value arrays on an most-recently-basic (MRU) basis, hence recently-accessed values are not re-read from Cassandra. Since query results consist of compact bit arrays, these are also cached at the clause level on an MRU basis. Hence, recent repeated query clauses are reused and act as cached queries. These caching levels produce natural “hot” and “warm” data pools that speed-up access to the most requested data.
  20. Let’s look at a real-world Doradus OLAP application: a Windows event tracking application. Shown is an example source event in CSV format. Each event consists of 10 fixed fields and 0 or more variable fields, called “insertion strings”. The number of insertion strings depends on the event’s ID. In this example, a 540 event has 7 insertion strings, one of which is null. The index of the insertion strings is significant: that is, we must store the index of each insertion string as well as the value since each position is meaningful to the corresponding event.
  21. Doradus OLAP requires predefined tables and fields. However, although our event data is variable in format, we can capture all event fields with a two-table schema: Events stores the 10 fixed fields of each event, and InsertionStrings stores the index and value of each variable field. The Events.Params link field and its inverse InsertionStrings.Event connect each event to its insertion string objects. This allows us to find events and navigate to its insertion strings or vice versa. For this test, we loaded just under 115 million events. Since events average around 7.7 insertion strings each, this requires the creation of 880 million insertion string objects. Each event object is connected to its insertion string objects by populating the Params/Event links.
  22. A standard benchmark we use is loading the 115M Events data set on a single node running both Doradus and Cassandra. The server is a Dell PowerEdge™ server with dual Intel® Xeon® CPUs and 48GB of memory. We configure Cassandra to use a maximum of 2GB of memory, and we configure Doradus, which runs embedded in the load application, to use a maximum of 8GB of memory. The load app uses 15 threads to load data in parallel. The Events data set spans 860 calendar days, and we load data using one-day shards, which creates 860 shards. The bulk load application creates a total of over 994 million objects. On a single server, this data loads in about about 15 minutes. This works out to an average load rate of 124K events per second or 1.1M total objects per second (Events + InsertionStrings). When the Cassandra database is flushed and compacted, the “nodetool status” command reports that the entire database takes 1.89 GB. In another test, we compared the space used by Doradus to the space used by a relational database for an auditing application. The SQL Server database required 8.5K per object whereas Doradus required only 87 bytes–a savings of almost 99%! Doradus OLAP’s space savings result from columnar storage, compression, de-duping, and the lack of indexes.
  23. Here is a simple DQL query that uses our events application. Because Doradus provides a REST API, we can just use a browser to query the database. The full REST URL (which assumes Doradus is running locally) is shown. The significant parts of the URL are highlighted: The first node in the URI is the application name containing the data to be queried. The second node is the table being queried. The third node is the system resource name for aggregate queries, called _aggregate. (System resources begin with an underscore.) As a query element of the URI, the m parameter defines the metric function to be computed; COUNT(*) simply counts all objects. The range parameter defines the range of shards to be queried; range=0 is shorthand for “all shards”. This query counts all events in all 860 shards. This query takes less than 10 seconds on a “cold” system and less than 1 second on a warm system.
  24. This example performs a typical analytical query. Suppose we’re trying to see if there’s a pattern of when certain privileged events fail. That is, we want to know if they mostly fail at the same time of day. The query looks for events where (1) the EventID belongs to a specific set of privileged event numbers, (2) the event Type field denotes a failure, and (3) the result code from the failed Windows operation (Index=8) has a specific event code (Value=‘(0x0,0x37)’). Since we are querying in a six month range, this query will search 181 shards. Shown is the REST command for this query with the URI encoding details removed: The metric parameter (m) is COUNT(*). The shards are selected using a range parameter that selects the beginning and ending range. The query expression (q) uses three clauses that specify the selection criteria. Notice that objects are selected in part based on having an object linked via the Params field whose Index is 8 and whose Value is the hex code we’re looking for. The grouping parameter (f) is the HOUR component of the Timestamp field. Only the TOP 5 groups are returned, and the grouping expression is renamed HourOfDay for easier parsing in the query results, which are shown on the next slide. BTW, these results can be returned in JSON by appending “&format=json” to the query.
  25. Shown is a typical response to our “privileged event failures” query. The elements of this query result are: aggregate: echoes the parameters used to create the query. totalobjects: Shows the total number of objects that were selected by the query. summary: This is a “rolled up” value of all groups, For this query, summary is the same value as totalobjects because the metric function is COUNT(*). totalgroups: This indicates that data fell into a total of 17 groups. Since the grouping field was hour-of-day, this means 17 separate hours had occurrences of the requested privileged event failures. groups: This outer element contains one group for which a metric computation was made. Because the grouping expression requested TOP(5), only the top 5 groups (with the highest metric value) were returned. group: For each group, the metric value of the group is returned and the identity of the group, which was named HourOfDay because of the AS clause. This shows that most of these failed events occur around 08:00, probably because people are logging into their computer at that time.
  26. To summarize, the primary advantages of Doradus OLAP are: The REST API is easy to use in all languages and platforms without requiring a specialized client library. All fields are searchable without using specialized indexes. The lack of indexes means Doradus OLAP is ideal for ad-hoc statistical queries. DQL extends Lucene’s full text query language with graph-based query features that are much simpler than joins. Fast data loading and shard merging allows an OLAP database to provide near real time data warehousing. Because data is stored very compactly, less disk space is required, saving up to 99% compared to other databases. Combined with fast in-memory query processing, this means less hardware is required compared to other big data approaches. A single node will be sufficient for many analytics applications. But when necessary, the database can be scaled horizontally using Cassandra’s replication and sharding features. Doradus OLAP works best for application where: Data is received as a continuous stream: events, log records, transactions, etc. Data is structured (all fields are predefined) or semi-structured (variability can be accommodated as in the Events application in this presentation). Data can be accumulated and loaded in batches. Data can be partitioned into shards. Time-series data works the best, but strictly speaking other partitioning criteria can work. Data is typically queried in a subset of shards. For time-sharded data, this means queries select specific time ranges. The application primarily uses statistical queries. Full text queries are supported, but statistical queries are Doradus OLAP’s forte.
  27. The Dell Software Group uses many open source software projects, and we’re pleased we can contribute back to the open source community. Doradus is open source under the Apache License 2.0, so everyone if free to download, modify, and use the code. Full source code and documentation are available from Github. Binary, source code, and Java doc bundles are available from Maven central. Comments and suggestions are more than welcome, so feel free to contact me. I would also like to thank the sponsors of this O’Reilly webcast—Dell and Intel—whose support allows open source software and free webcasts such as this possible. Thank you! - Randy Guck