8. Apache Cassandra™
• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-
critical online applications
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with no single point of failure
• Distributed and data center aware
• 100% uptime
• Predictable scaling
• High Performance
• Multi Data Center
• Time Series
• Tunable Consistency
• Simple to Operate
• CQL language
• OpsCenter / DevCenter
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
14. Cassandra Data Access
CQL language via cqlsh (command line) or DevCenter
(development environnement) or drivers
• Drivers on Cassandra native protocol
• Command CQL COPY
• Import/Export tools for massive bulk loader
• Connectors in ETL solutions (Talend, Informatica)
• Via analytics layers Spark and Hadoop
• Via ODBC/JDBC drivers
16. Connexions ODBC / JDBC
ODBC drivers
• For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server
• For Hive (Hadoop SQL engine)
• For Cassandra directly (ANSI SQL or CQL requests)
JDBC drivers
• For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server
• For Cassandra directly (in progress)
• JDBC drivers from the community but not officialy supported
18. Real-Time / Operational Analytics Use Cases
Recommendation Engine
Internet of Things
Fraud Detection
Risk Analysis
Buyer Behaviour Analytics
Telematics, Logistics
Business Intelligence
Infrastructure Monitoring
…
19. How to do analytics on Cassandra data ?
Remember …
Cassandra = NO JOIN , NO GROUP BY , Filter on Primary Key only
2 solutions:
• CQL with predictable queries
• Joins and Aggregations on the fly:
Server level => Need a distributed processing framework : Hadoop or Spark
Client level => Possible but risky !
20. Reporting and Dashboard
Confidential 20
• Static and operational dashboards and reports created for a
specific Cassandra application.
• CQL, Solr queries and DataStax drivers
• KPI and aggregations pre-calculated with scheduled batch or on
the fly during insert.
21. BI & Data Visualization tools
21
For BI and Data Visualization tools like Tableau Software,
Power BI, Qlikview, Excel ….
• DataStax ODBC driver
SQL joins and aggregations executed at client level !
• Spark ODBC driver (from Databricks or Microsoft)
SQL translated in Spark jobs and executed at server level
23. Power BI Desktop
23
Support for On-Prem Spark distributions
“The new data source in this month’s release is support for On-Prem Spark distributions. Last
month, we added support for Microsoft Azure HDInsight Spark, and this month we’re expanding
to other Spark distributions.
This new connector can be found under the “Other” category in the “Get Data” dialog.”
http://blogs.msdn.com/b/powerbi/archive/2015/09/23/44-new-features-in-the-power-bi-desktop-
september-update.aspx
Microsoft Spark ODBC Driver
24. Notebook
24
Run code (Spark or CQL) from a Web browser
Notebooks like Zeppelin, Spark Notebook, Jupyter
For example Zeppelin:
• Examples available for Cassandra
• CQL language interpretor
• https://github.com/doanduyhai/incubator-zeppelin
26. Analytics with DataStax Enterprise
There are 4 ways to do Analytics on Cassandra data:
• Reporting with CQL queries
• Integrated Search (Solr)
• Integrated Batch Analytics (Hadoop integrated) on Cassandra
• Integrated Near Real-Time Analytics (Spark)
• Virtual multi data centers optimised as required – different workloads, hardware, availability etc..
• Cassandra will replicate the data for you – no ETL is necessary
• Cassandra node started with Solr, Hadoop or Spark
Cassandra
Replication
Transactions Analytics
27. Enterprise Search & Powerfull Secondary Index
• Built-in enterprise search on Cassandra data via a strong Apache Solr and Lucene
integration
• Facets, Filtering, Geospatial search, Text Analysis, Joins, etc.
• Real-time indexing process and search operations
• Search queries from CQL and REST/Solr
• Solr shortcomings:
• No bottleneck. Client can read/write to any Solr node.
• Search index partitioning and replication for scalability and availability.
• Multi-DC support
• Data durability (Solr lacks write-ahead log, data can be lost)
27
Cassandra
Replication
Customer
Facing
Search
Nodes
31. Spark Use Cases
31
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
37. Ooyala Use Case : Hadoop + Cassandra
Company Confidential 37
By leveraging data stored in Apache Cassandra, Ooyala is helping their customers take a more strategic
approach when delivering a digital video experience, so they can get ahead in this fast-evolving space.
http://www.datastax.com/resources/casestudies/ooyala
San Francisco-based video services company Ooyala provides a suite of technologies and services that support content
owners in managing, analyzing and monetizing the digital video they publish online, on mobile devices, and through the over-
the-top distribution platform for delivering Internet video to television.
38. Spotify Use Case : Hadoop + Cassandra
Company Confidential 38
https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/
Personalization at Spotify using Cassandra
Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance.
It uses aspects of Dynamos partitioning and replication and a log-structured data model similar to Bigtable’s.
It takes its distribution algorithm from Dynamo and its data model from Bigtable.
Cassandra is a reinvented database which is lightening fast and always on ideal for todays online applications where relational databases like Oracle can’t keep up.
This means that in todays world, cassandra stores and processes real time information at fast, predictive performance and built in fault tolerance
Predictive analytics
Does this simple architecture look familiar to you? Lambda
Nathan Marz