SlideShare a Scribd company logo
Overview of Cassandra and
The Doradus OSS Project
Randy Guck
Principal Engineer, Dell Software
Overview
•  What is No SQL?
– Common RDB roadblocks
– NoSQL database types
•  Overview of Cassandra
– What's unique
– Limitations
•  Doradus
– Architecture
– Features
– The OLAP and Spider storage managers
– What each is good for
– Where to get Doradus
Why RDB Apps Look for Something Else
•  Performance
– B-trees
– Locking
– One writable copy of each record
•  Scaling costs
– RDBs scale "up"
– Big boxes, SANs, fiber channel, etc.
•  What if you want...
– Distributed access
– No single points of failure
– Instant failover
– Sharding
– Replication
NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value
LevelDB, Kyoto Cabinet,
Redis
No No No
Distributed Key–
Value
Dynamo, MemcacheDB,
Riak, Voldemort
Yes No No
Column-Oriented
Accumulo, Cassandra,
HBase
Yes Some No
Document-
Oriented
Couchbase,
Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value
LevelDB, Kyoto Cabinet,
Redis
No No No
Distributed Key–
Value
Dynamo, MemcacheDB,
Riak, Voldemort
Yes No No
Column-Oriented
Accumulo, Cassandra,
HBase
Yes Some No
Document-
Oriented
Couchbase,
Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
Doradus goals
NoSQL Common Traits
•  Distributed cluster of nodes
– Commodity, shared-nothing servers
– Scales horizontally
– Expands elastically
•  Replication
– Performant local access
– Automatic failover
•  De-normalized data model
•  Schemaless/dynamic columns
•  Eventual consistency
N=5, RF=3
Is NoSQL Catching On?
Source: db-engines.com
Overview of Cassandra
•  Wide column NoSQL database
•  Open sourced by Facebook
•  Apache Project with active community
•  Commercially support by DataStax,
Acunu, others
•  Used by 1,500+ companies
•  "Pure peer" architecture
•  Largest known Cassandra cluster:
300+ TB data and 400+ machines.
What is Cassandra best for?
•  Continuous data streams
– Logs, events, audit records, measurements, ...
– Fast data ingestion
– Predictable read performance
•  Partitionable data
– "1,000's of little databases in one"
•  Elastic scalability
– Expand/upgrade/repair without downtime
•  Not good for:
– Blob store
– Persistent queue
– OLTP transactions
CQL Static Table
CREATE	
  TABLE	
  songs	
  (	
  
	
  	
  	
  id	
  	
  	
  	
  	
  	
  uuid	
  PRIMARY	
  KEY,	
  
	
  	
  	
  title	
  	
  	
  text,	
  
	
  	
  	
  album	
  	
  	
  text,	
  
	
  	
  	
  artist	
  	
  text,	
  
	
  	
  	
  data	
  	
  	
  	
  blob	
  
);	
  
CREATE	
  INDEX	
  ON	
  songs	
  (artist);	
  
Row Key Columns: "<column	
  name>"="<column	
  value>"	
  
62c36...	
   "album"="90125"	
   "artist"="Yes"	
   "data"=<audio>	
   "title"="Changes"	
  
837a2...	
   "album"="Crystal	
  Ball"	
   "artist"="Styx"	
   "data"=<audio>	
   "title"="Put	
  Me	
  On"	
  
2de83...	
   "album"="Nevermind"	
   "artist"="Nirvana"	
   "data"=<audio>	
   "title"="Breed"	
  
...	
  
CQL Clustered Table
CREATE	
  TABLE	
  playlists	
  (	
  
	
  	
  	
  id	
  	
  	
  	
  	
  	
  	
  	
  	
  uuid,	
  
	
  	
  	
  song_order	
  int,	
  
	
  	
  	
  song_id	
  	
  	
  	
  uuid,	
  	
  	
  //	
  copied	
  from	
  songs.id	
  
	
  	
  	
  title	
  	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.title	
  
	
  	
  	
  album	
  	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.album	
  
	
  	
  	
  artist	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.artist	
  
	
  	
  	
  PRIMARY	
  KEY	
  (id,	
  song_order)	
  	
  	
  //	
  compound	
  key	
  
);	
  
Row Key Columns: "<song_order>:<column	
  name>"="<column	
  value>"	
  
28d23...	
  
"1:"=""	
   "1:album"="90125"	
   "1:artist"="Yes"	
   "1:song_id"="62c36..."	
  
"1:title"="Changes"	
   "2:"=""	
   "2:album"="Nevermind"	
   "2:artist"="Nirvana"	
  
"2:song_id"="2de83..."	
   "2:title"="Breed"	
   "3:"=""	
   ...	
  
2ed91...	
  
"1:"=""	
   "1:album"="Crystal	
  Ball"	
   "1:artist"="Styx"	
   "1:song_id"="837a2..."	
  
"1:title"="Put	
  Me	
  On"	
   "2:"=""	
   ...	
  
...	
  
Row Key Columns: "<song_order>:<column	
  name>"="<column	
  value>"	
  
28d23...	
  
"1:"=""	
   "1:album"="90125"	
   "1:artist"="Yes"	
   "1:song_id"="62c36..."	
  
"1:title"="Changes"	
   "2:"=""	
   "2:album"="Nevermind"	
   "2:artist"="Nirvana"	
  
"2:song_id"="2de83..."	
   "2:title"="Breed"	
   "3:"=""	
   ...	
  
2ed91...	
  
"1:"=""	
   "1:album"="Crystal	
  Ball"	
   "1:artist"="Styx"	
   "1:song_id"="837a2..."	
  
"1:title"="Put	
  Me	
  On"	
   "2:"=""	
   ...	
  
...	
  
CQL Clustered Table (cont.)
CQL "Rows"
CREATE	
  TABLE	
  playlists	
  (	
  
	
  	
  	
  id	
  	
  	
  	
  	
  	
  	
  	
  	
  uuid,	
  
	
  	
  	
  song_order	
  int,	
  
	
  	
  	
  song_id	
  	
  	
  	
  uuid,	
  	
  	
  //	
  copied	
  from	
  songs.id	
  
	
  	
  	
  title	
  	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.title	
  
	
  	
  	
  album	
  	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.album	
  
	
  	
  	
  artist	
  	
  	
  	
  	
  text,	
  	
  	
  //	
  copied	
  from	
  songs.artist	
  
	
  	
  	
  PRIMARY	
  KEY	
  (id,	
  song_order)	
  	
  	
  //	
  compound	
  key	
  
);	
  
Can we make Cassandra more appealing?
•  Data Model
– No direct support for relationships
•  Indexing
– Secondary indexes: single column only
– Hash table only: no range searching
•  Searching
– No joins, embedded queries
– No aggregate queries
– Limited equalities (e.g., SELECT * WHERE <key> IN (<list>))
– No full text search
– No OR clauses
– ...
What is Doradus?
•  Java service that enhances Cassandra
•  Adds features:
– REST API (JSON and XML)
– Multi-tenancy
– Graph model
– Multi-field/full text query language
– Automatic data aging
– OLAP and Spider storage services
•  Compatible with NoSQL tenets such as idempotent
updates
•  Under development for ~3 years
•  Open source: Apache 2.0 License
Doradus Graph Model
•  A cluster hosts one of more applications
•  An application own tables which store objects
•  An object consists of single- and multi-valued fields
•  A pair of link fields form a bi-directional relationship
Message
{Size, SendDate}
Participant
{ReceiptDate}
Address
{Name}
Person
{Name, Department}
Attachment
{Size, Extension} Managerè
çEmployees
êPerson
Address é
êAttachments
Messageé
Recipientsè
çMessageAsRecipient
Addressè
çParticipants
Senderè
çMessageAsSender
Example Object and Aggregate Queries
•  Lucene full text query
GET	
  /Email/Person/_query?q=FirstName:j*	
  AND	
  NOT	
  Office:[q	
  TO	
  z]	
  
•  Link path with filtering
GET	
  /Email/Message/_query?q=	
  
Sender.WHERE(ReceiptDate>'2010-­‐06-­‐01').Address.Name="*.com"	
  
•  Quantifiers
GET	
  /Email/Message/_aggregate?m=COUNT(*)	
  	
  
&q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales	
  
&f=Tags,TOP(3,TRUNCATE(SendDate,DAY))	
  
	
  
•  Transitive links
GET	
  /Email/Person/_query?q=DirectReports^(3).LastName=wilson	
  
&f=DirectReports(Name,DirectReports(Name))	
  
	
  
Doradus: Architecture
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and
Log files
Doradus: Multi-Data Center Clusters
Cassandra
Doradus
Cassandra Cassandra
Doradus
Cassandra
Doradus
Cassandra Cassandra
Doradus
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Rack 1, Data Center 1 Rack 1, Data Center 2
Applications Applications
DC=2, N=6, RF=3
Doradus: Internal Architecture
App App App
Monitor
App
Spider
Storage Service
OLAP
Storage Service
Cassandra Cluster
JMX
REST: Embedded Jetty Server
Cassandra Interface
doradus.yaml
REST
Doradus OLAP Service
•  Borrows from online analytical processing
– Sharding as data "cubes"
– Columnar storage
•  Very dense storage
– No indexes!
– Value arrays are compressed
•  Fast load time
– Up to 500,000 objects/second/node
– Small "data lag" time
•  Very fast queries
– Searches millions of objects/second
– Full DQL object and aggregate query support
OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources
OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
T2
T3
T4
T4
Sources Segments
…
Changes in
last n minutes
OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards
…
…
Changes in
last n minutes
Date-based shards
OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards OLAP Store
…
…
Changes in
last n minutes
Date-based shards
OLAP Use Case
•  Data: Windows Events
– 115M events
•  Test parameters
– Server: Quad Xeon CPUs, 32GB memory, 3 disks
– Cassandra memory: 1GB
– Load app/embedded Doradus memory: 4GB
– Load threads: 5
– Batch size: 5,000 events
– Shard size: 1 day (860 shards total)
•  Test results
– Total objects loaded: ~1 billion
– Total time: 32 minutes, 56 seconds
– Load rate: 502,991 objects/second
– Final database size: ~2GB
Doradus Spider Service
•  Analogous to Lucene + NoSQL
•  Fully inverted field indexing
– Configurable analyzers
– Stored-only (non-indexed) fields
•  Unique features:
– Automatic table-level sharding
– Statistics
– Pre-computed aggregate queries
– Refreshed in background
– Object-level data aging
•  Use case example:
– Indexing a massive number of documents
OLAP and Spider: When to Use
•  Spider is best for:
– Unstructured/variable-
structure data
– Configurable indexing
– Fine-grained updates with
immediate indexing
– Document storage and
searching
– Emphasis on full-text/multi-
field searching
•  OLAP is best for:
– High-volume data streams
– High performance analytic
queries
– Dense data storage
– Immutable/semi-mutable
data
– Data that can be loaded in
batches
– Data that can be partitioned
(e.g., time-sharded)
Summary
•  What's cool about Doradus?
– Bi-directional links with referential integrity
– Link paths: simpler than joins
– Idempotent updates
– Partial object updates
– Simple transitive searching
– OLAP: dense storage and fast queries
– It's free!
Thank you !
Doradus is available at:
https://github.com/dell-oss/Doradus
Contact me:
randy.guck@dell.software.com

More Related Content

What's hot

Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
Hirokazu Tokuno
 
Amazon DynamoDB Workshop
Amazon DynamoDB WorkshopAmazon DynamoDB Workshop
Amazon DynamoDB Workshop
Amazon Web Services
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMRVancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
Duyhai Doan
 
managing big data
managing big datamanaging big data
managing big data
Suveeksha
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
Duyhai Doan
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
Duyhai Doan
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search ride
Duyhai Doan
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
Victoria Malaya
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
Om nom nom nom
Om nom nom nomOm nom nom nom
Om nom nom nom
Anna Pawlicka
 
Munich March 2015 - Cassandra + Spark Overview
Munich March 2015 -  Cassandra + Spark OverviewMunich March 2015 -  Cassandra + Spark Overview
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basics
Duyhai Doan
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev days
Duyhai Doan
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
Ashish kumar
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
Nate Murray
 

What's hot (20)

Amazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by BeginnerAmazon DynamoDB Lessen's Learned by Beginner
Amazon DynamoDB Lessen's Learned by Beginner
 
Amazon DynamoDB Workshop
Amazon DynamoDB WorkshopAmazon DynamoDB Workshop
Amazon DynamoDB Workshop
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMRVancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
managing big data
managing big datamanaging big data
managing big data
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search ride
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Om nom nom nom
Om nom nom nomOm nom nom nom
Om nom nom nom
 
Munich March 2015 - Cassandra + Spark Overview
Munich March 2015 -  Cassandra + Spark OverviewMunich March 2015 -  Cassandra + Spark Overview
Munich March 2015 - Cassandra + Spark Overview
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basics
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev days
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 

Similar to Overiew of Cassandra and Doradus

Presentation
PresentationPresentation
Presentation
Dimitris Stripelis
 
Introduction to SQL Server Graph DB
Introduction to SQL Server Graph DBIntroduction to SQL Server Graph DB
Introduction to SQL Server Graph DB
Greg McMurray
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao Introduction
Booch Lin
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
Christopher Batey
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Reading Cassandra Meetup Feb 2015: Apache Spark
Reading Cassandra Meetup Feb 2015: Apache SparkReading Cassandra Meetup Feb 2015: Apache Spark
Reading Cassandra Meetup Feb 2015: Apache Spark
Christopher Batey
 
C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...
C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...
C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...
DataStax Academy
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_newMongoDB
 
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Ontico
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaone
Simon Elliston Ball
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Paul Leclercq
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
Eric Bottard
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
Alexander Hendorf
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
christkv
 
Hybrid Databases - PHP UK Conference 22 February 2019
Hybrid Databases - PHP UK Conference 22 February 2019Hybrid Databases - PHP UK Conference 22 February 2019
Hybrid Databases - PHP UK Conference 22 February 2019
Dave Stokes
 

Similar to Overiew of Cassandra and Doradus (20)

Presentation
PresentationPresentation
Presentation
 
Introduction to SQL Server Graph DB
Introduction to SQL Server Graph DBIntroduction to SQL Server Graph DB
Introduction to SQL Server Graph DB
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao Introduction
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Reading Cassandra Meetup Feb 2015: Apache Spark
Reading Cassandra Meetup Feb 2015: Apache SparkReading Cassandra Meetup Feb 2015: Apache Spark
Reading Cassandra Meetup Feb 2015: Apache Spark
 
C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...
C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...
C* Summit EU 2013: Denormalizing Your Data: A Java Library to Support Structu...
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new2012 mongo db_bangalore_roadmap_new
2012 mongo db_bangalore_roadmap_new
 
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)Postgres vs Mongo / Олег Бартунов (Postgres Professional)
Postgres vs Mongo / Олег Бартунов (Postgres Professional)
 
When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaone
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
 
Hybrid Databases - PHP UK Conference 22 February 2019
Hybrid Databases - PHP UK Conference 22 February 2019Hybrid Databases - PHP UK Conference 22 February 2019
Hybrid Databases - PHP UK Conference 22 February 2019
 

Recently uploaded

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 

Recently uploaded (20)

TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 

Overiew of Cassandra and Doradus

  • 1. Overview of Cassandra and The Doradus OSS Project Randy Guck Principal Engineer, Dell Software
  • 2. Overview •  What is No SQL? – Common RDB roadblocks – NoSQL database types •  Overview of Cassandra – What's unique – Limitations •  Doradus – Architecture – Features – The OLAP and Spider storage managers – What each is good for – Where to get Doradus
  • 3. Why RDB Apps Look for Something Else •  Performance – B-trees – Locking – One writable copy of each record •  Scaling costs – RDBs scale "up" – Big boxes, SANs, fiber channel, etc. •  What if you want... – Distributed access – No single points of failure – Instant failover – Sharding – Replication
  • 4. NoSQL Data Models Data Model Examples Elastic? Queries? Relationships? Key–Value LevelDB, Kyoto Cabinet, Redis No No No Distributed Key– Value Dynamo, MemcacheDB, Riak, Voldemort Yes No No Column-Oriented Accumulo, Cassandra, HBase Yes Some No Document- Oriented Couchbase, Elasticsearch, MongoDB Yes Yes Some Graph Neo4J, OrientDB, Titan No Yes Yes Sharding + replication AND/OR/ranges/etc. Built-in support
  • 5. NoSQL Data Models Data Model Examples Elastic? Queries? Relationships? Key–Value LevelDB, Kyoto Cabinet, Redis No No No Distributed Key– Value Dynamo, MemcacheDB, Riak, Voldemort Yes No No Column-Oriented Accumulo, Cassandra, HBase Yes Some No Document- Oriented Couchbase, Elasticsearch, MongoDB Yes Yes Some Graph Neo4J, OrientDB, Titan No Yes Yes Sharding + replication AND/OR/ranges/etc. Built-in support Doradus goals
  • 6. NoSQL Common Traits •  Distributed cluster of nodes – Commodity, shared-nothing servers – Scales horizontally – Expands elastically •  Replication – Performant local access – Automatic failover •  De-normalized data model •  Schemaless/dynamic columns •  Eventual consistency N=5, RF=3
  • 7. Is NoSQL Catching On? Source: db-engines.com
  • 8. Overview of Cassandra •  Wide column NoSQL database •  Open sourced by Facebook •  Apache Project with active community •  Commercially support by DataStax, Acunu, others •  Used by 1,500+ companies •  "Pure peer" architecture •  Largest known Cassandra cluster: 300+ TB data and 400+ machines.
  • 9. What is Cassandra best for? •  Continuous data streams – Logs, events, audit records, measurements, ... – Fast data ingestion – Predictable read performance •  Partitionable data – "1,000's of little databases in one" •  Elastic scalability – Expand/upgrade/repair without downtime •  Not good for: – Blob store – Persistent queue – OLTP transactions
  • 10. CQL Static Table CREATE  TABLE  songs  (        id            uuid  PRIMARY  KEY,        title      text,        album      text,        artist    text,        data        blob   );   CREATE  INDEX  ON  songs  (artist);   Row Key Columns: "<column  name>"="<column  value>"   62c36...   "album"="90125"   "artist"="Yes"   "data"=<audio>   "title"="Changes"   837a2...   "album"="Crystal  Ball"   "artist"="Styx"   "data"=<audio>   "title"="Put  Me  On"   2de83...   "album"="Nevermind"   "artist"="Nirvana"   "data"=<audio>   "title"="Breed"   ...  
  • 11. CQL Clustered Table CREATE  TABLE  playlists  (        id                  uuid,        song_order  int,        song_id        uuid,      //  copied  from  songs.id        title            text,      //  copied  from  songs.title        album            text,      //  copied  from  songs.album        artist          text,      //  copied  from  songs.artist        PRIMARY  KEY  (id,  song_order)      //  compound  key   );   Row Key Columns: "<song_order>:<column  name>"="<column  value>"   28d23...   "1:"=""   "1:album"="90125"   "1:artist"="Yes"   "1:song_id"="62c36..."   "1:title"="Changes"   "2:"=""   "2:album"="Nevermind"   "2:artist"="Nirvana"   "2:song_id"="2de83..."   "2:title"="Breed"   "3:"=""   ...   2ed91...   "1:"=""   "1:album"="Crystal  Ball"   "1:artist"="Styx"   "1:song_id"="837a2..."   "1:title"="Put  Me  On"   "2:"=""   ...   ...  
  • 12. Row Key Columns: "<song_order>:<column  name>"="<column  value>"   28d23...   "1:"=""   "1:album"="90125"   "1:artist"="Yes"   "1:song_id"="62c36..."   "1:title"="Changes"   "2:"=""   "2:album"="Nevermind"   "2:artist"="Nirvana"   "2:song_id"="2de83..."   "2:title"="Breed"   "3:"=""   ...   2ed91...   "1:"=""   "1:album"="Crystal  Ball"   "1:artist"="Styx"   "1:song_id"="837a2..."   "1:title"="Put  Me  On"   "2:"=""   ...   ...   CQL Clustered Table (cont.) CQL "Rows" CREATE  TABLE  playlists  (        id                  uuid,        song_order  int,        song_id        uuid,      //  copied  from  songs.id        title            text,      //  copied  from  songs.title        album            text,      //  copied  from  songs.album        artist          text,      //  copied  from  songs.artist        PRIMARY  KEY  (id,  song_order)      //  compound  key   );  
  • 13. Can we make Cassandra more appealing? •  Data Model – No direct support for relationships •  Indexing – Secondary indexes: single column only – Hash table only: no range searching •  Searching – No joins, embedded queries – No aggregate queries – Limited equalities (e.g., SELECT * WHERE <key> IN (<list>)) – No full text search – No OR clauses – ...
  • 14. What is Doradus? •  Java service that enhances Cassandra •  Adds features: – REST API (JSON and XML) – Multi-tenancy – Graph model – Multi-field/full text query language – Automatic data aging – OLAP and Spider storage services •  Compatible with NoSQL tenets such as idempotent updates •  Under development for ~3 years •  Open source: Apache 2.0 License
  • 15. Doradus Graph Model •  A cluster hosts one of more applications •  An application own tables which store objects •  An object consists of single- and multi-valued fields •  A pair of link fields form a bi-directional relationship Message {Size, SendDate} Participant {ReceiptDate} Address {Name} Person {Name, Department} Attachment {Size, Extension} Managerè çEmployees êPerson Address é êAttachments Messageé Recipientsè çMessageAsRecipient Addressè çParticipants Senderè çMessageAsSender
  • 16. Example Object and Aggregate Queries •  Lucene full text query GET  /Email/Person/_query?q=FirstName:j*  AND  NOT  Office:[q  TO  z]   •  Link path with filtering GET  /Email/Message/_query?q=   Sender.WHERE(ReceiptDate>'2010-­‐06-­‐01').Address.Name="*.com"   •  Quantifiers GET  /Email/Message/_aggregate?m=COUNT(*)     &q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales   &f=Tags,TOP(3,TRUNCATE(SendDate,DAY))     •  Transitive links GET  /Email/Person/_query?q=DirectReports^(3).LastName=wilson   &f=DirectReports(Name,DirectReports(Name))    
  • 18. Doradus: Multi-Data Center Clusters Cassandra Doradus Cassandra Cassandra Doradus Cassandra Doradus Cassandra Cassandra Doradus Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Rack 1, Data Center 1 Rack 1, Data Center 2 Applications Applications DC=2, N=6, RF=3
  • 19. Doradus: Internal Architecture App App App Monitor App Spider Storage Service OLAP Storage Service Cassandra Cluster JMX REST: Embedded Jetty Server Cassandra Interface doradus.yaml REST
  • 20. Doradus OLAP Service •  Borrows from online analytical processing – Sharding as data "cubes" – Columnar storage •  Very dense storage – No indexes! – Value arrays are compressed •  Fast load time – Up to 500,000 objects/second/node – Small "data lag" time •  Very fast queries – Searches millions of objects/second – Full DQL object and aggregate query support
  • 25. OLAP Use Case •  Data: Windows Events – 115M events •  Test parameters – Server: Quad Xeon CPUs, 32GB memory, 3 disks – Cassandra memory: 1GB – Load app/embedded Doradus memory: 4GB – Load threads: 5 – Batch size: 5,000 events – Shard size: 1 day (860 shards total) •  Test results – Total objects loaded: ~1 billion – Total time: 32 minutes, 56 seconds – Load rate: 502,991 objects/second – Final database size: ~2GB
  • 26. Doradus Spider Service •  Analogous to Lucene + NoSQL •  Fully inverted field indexing – Configurable analyzers – Stored-only (non-indexed) fields •  Unique features: – Automatic table-level sharding – Statistics – Pre-computed aggregate queries – Refreshed in background – Object-level data aging •  Use case example: – Indexing a massive number of documents
  • 27. OLAP and Spider: When to Use •  Spider is best for: – Unstructured/variable- structure data – Configurable indexing – Fine-grained updates with immediate indexing – Document storage and searching – Emphasis on full-text/multi- field searching •  OLAP is best for: – High-volume data streams – High performance analytic queries – Dense data storage – Immutable/semi-mutable data – Data that can be loaded in batches – Data that can be partitioned (e.g., time-sharded)
  • 28. Summary •  What's cool about Doradus? – Bi-directional links with referential integrity – Link paths: simpler than joins – Idempotent updates – Partial object updates – Simple transitive searching – OLAP: dense storage and fast queries – It's free!
  • 29. Thank you ! Doradus is available at: https://github.com/dell-oss/Doradus Contact me: randy.guck@dell.software.com