SlideShare a Scribd company logo
1 of 18
Analytics on Apache Cassandra,
an Operational Distributed Database
Victor Coustenoble Solutions
Engineer
victor.coustenoble@datastax.com
@vizanalytics
Paris Tech Talks Meetup
March 24th 2015
Apache Cassandra™
• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-
critical online applications
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with no single point of failure
• Distributed and data center aware
• 100% uptime
• Predictable scaling
• High Performance
• Multi Data Center
• Time Series
• Tunable Consistency
• Simple to Operate
• CQL language
• OpsCenter / DevCenter
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
High Availability and Strong Consistency !
• A single node failure shouldn’t bring failure.
• Replication Factor + Consistency Level = Success
• This example:
• RF = 3
• CL = QUORUM (= 51% replicas)
©2014 DataStax Confidential. Do not distribute without consent. 3
Node 1
1st copy
Node 4
Node 5
Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
>51% ack – so request is a success
CL(Read) + CL(Write) > RF => Strong Consistency
Real-Time / Operational Big Data Use Cases
Recommendation Engine
Internet of Things
Fraud Detection
Risk Analysis
Buyer Behaviour Analytics
Telematics, Logistics
Business Intelligence
Infrastructure Monitoring
…
How to do analytics on Cassandra data ?
Remember …
Cassandra = NO JOIN , NO GROUP BY , Filter on PK only
Cassandra needs a distributed processing framework
Data model independent queries
Cross-table operations (JOIN, UNION, etc.)
Complex analytics (e.g. machine learning)
Data transformation, aggregation, etc.
Stream processing
Much more …..
Analytics on Cassandra
There are 4 ways to do Analytics on Cassandra data:
• Integrated Search (Solr)
• Integrated Batch Analytics (Hadoop integrated) on Cassandra
• External Batch Analytics (External Hadoop; certified with Cloudera, HortonWorks)
• Integrated Near Real-Time Analytics (Spark)
©2014 DataStax Confidential. Do not distribute without consent.
• Virtual multi data centers optimised as required – different workloads, hardware, availability etc..
• Cassandra will replicate the data for you – no ETL is necessary
• Cassandra node started with Solr, Hadoop or Spark
Cassandra
Replication
Transactions Analytics
Enterprise Search
• Built-in enterprise search on Cassandra data via Solr integration
• Facets, Filtering, Geospatial search, Text Analysis, etc.
• Near real-time search operations
• Search queries from CQL and REST/Solr
• Solr shortcomings:
• No bottleneck. Client can read/write to any Solr node.
• Search index partitioning and replication for scalability and availability.
• Multi-DC support
• Data durability (Solr lacks write-ahead log, data can be lost)
8
Cassandra
Replication
Customer
Facing
Search
Nodes
Batch Analytics - Hadoop
• Integrated Hadoop 1.0.4
• CFS (Cassandra File System) , no HDFS
• No Single Point of failure
• No Hadoop complexity – every node is built the same
• Hive / Pig / Sqoop / Mahout
©2014 DataStax Confidential. Do not distribute without consent. 9
Cassandra
Replication
Customer
Facing
Hadoop
Nodes
External Batch Analytics - BYOH
Bring Your Own Hadoop
External Hadoop
Resource
Manager
Hive
Request
• Hadoop 2.0.x support
• Cassandra Node as a Data Node
• Ex: Hive submit jobs to Job tracker
assigning tasks to Task trackers
installed on C* nodes
• Certified with Cloudera, HortonWorks
Cassandra
Nodes
Real-Time Analytics - Spark
• Tight integration with Cassandra
• Distributed Processing
• “In-memory Map/Reduce”, multi-thread, best for iterations
• GraphX, MLLib, SparkSQL, Shark (Hive SQL like)
• Spark Streaming - Real-time
• DataStax / Databricks partnership
• 10x – 100x speed of MapReduce
©2014 DataStax Confidential. Do not distribute without consent. 11
Cassandra
Replication
Customer
Facing
Spark
Nodes
« Big Data » SDK
Real-time Big Data
©2014 DataStax Confidential. Do not distribute without consent. 12
Data Enrichment
Batch Processing
Machine Learning
Pre-computed
aggregates
Data
NO ETL
Spark Use Cases
13
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
Hot / Cold Data in a DataStax architecture
© 2014 DataStax, All Rights Reserved. Company Confidential
Hot Data
Online Operational Application
Cold Data
Offline Application
DataStax Cassandra Enterprise
14
DataStax Enterprise vs. Hadoop
©2014 DataStax Confidential. Do not distribute without consent.
NoSQL Matters Paris
© 2014 DataStax, All Rights Reserved. Company Confidential 16
Tracks from Duy Hai Doan – Cassandra technical advocate
@doanduyhai
• Day 1 (Thursday 26) 13:45 – 17:45
Training : Introduction to Apache Cassandra, CQL and Data Modelling
• Day 2 (Friday 27) 16:30 – 17:15
Conference : Real time analytics with Cassandra and Spark
Cassandra Days
Company Confidential 17
Thanks
We power the big data apps
that transform business.
©2013 DataStax Confidential. Do not distribute without consent.

More Related Content

What's hot

Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)
Steve Min
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 

What's hot (20)

Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)[SSA] 03.newsql database (2014.02.05)
[SSA] 03.newsql database (2014.02.05)
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Webinar: How to Shrink Your Datacenter Footprint by 50%
Webinar: How to Shrink Your Datacenter Footprint by 50%Webinar: How to Shrink Your Datacenter Footprint by 50%
Webinar: How to Shrink Your Datacenter Footprint by 50%
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 

Viewers also liked

Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 

Viewers also liked (20)

Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
CrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For OperatorsCrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For Operators
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Carlos Santa María - Hiperconvergencia, el futuro del Data Center - semanainf...
Carlos Santa María - Hiperconvergencia, el futuro del Data Center - semanainf...Carlos Santa María - Hiperconvergencia, el futuro del Data Center - semanainf...
Carlos Santa María - Hiperconvergencia, el futuro del Data Center - semanainf...
 
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
 
NGCC 2016 - Support large partitions
NGCC 2016 - Support large partitionsNGCC 2016 - Support large partitions
NGCC 2016 - Support large partitions
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
3800 die-bonder overview
3800 die-bonder overview3800 die-bonder overview
3800 die-bonder overview
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For Operators
 
Securing Cassandra
Securing CassandraSecuring Cassandra
Securing Cassandra
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 

Similar to DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup

Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprise
jbellis
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptxBuilding Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
thando80
 

Similar to DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup (20)

BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprise
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptxBuilding Big Data Solutions with Azure Data Lake.10.11.17.pptx
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 

More from Victor Coustenoble

More from Victor Coustenoble (11)

Préparation de Données pour la Détection de Fraude
Préparation de Données pour la Détection de FraudePréparation de Données pour la Détection de Fraude
Préparation de Données pour la Détection de Fraude
 
Préparation de Données dans le Cloud
Préparation de Données dans le CloudPréparation de Données dans le Cloud
Préparation de Données dans le Cloud
 
Préparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec TrifactaPréparation de Données Hadoop avec Trifacta
Préparation de Données Hadoop avec Trifacta
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
DataStax et Cassandra dans Azure au Microsoft Techdays
DataStax et Cassandra dans Azure au Microsoft TechdaysDataStax et Cassandra dans Azure au Microsoft Techdays
DataStax et Cassandra dans Azure au Microsoft Techdays
 
Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?Quelles stratégies de Recherche avec Cassandra ?
Quelles stratégies de Recherche avec Cassandra ?
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 
DataStax Enterprise - La plateforme de base de données pour le Cloud
DataStax Enterprise - La plateforme de base de données pour le CloudDataStax Enterprise - La plateforme de base de données pour le Cloud
DataStax Enterprise - La plateforme de base de données pour le Cloud
 
Datastax Cassandra + Spark Streaming
Datastax Cassandra + Spark StreamingDatastax Cassandra + Spark Streaming
Datastax Cassandra + Spark Streaming
 
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
DataStax Enterprise et Cas d'utilisation de Apache CassandraDataStax Enterprise et Cas d'utilisation de Apache Cassandra
DataStax Enterprise et Cas d'utilisation de Apache Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
drm1699
 
Abortion Pill Prices Rustenburg [(+27832195400*)] 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Rustenburg [(+27832195400*)] 🏥 Women's Abortion Clinic i...Abortion Pill Prices Rustenburg [(+27832195400*)] 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Rustenburg [(+27832195400*)] 🏥 Women's Abortion Clinic i...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
Abortion Clinic In Stanger ](+27832195400*)[ 🏥 Safe Abortion Pills In Stanger...
Abortion Clinic In Stanger ](+27832195400*)[ 🏥 Safe Abortion Pills In Stanger...Abortion Clinic In Stanger ](+27832195400*)[ 🏥 Safe Abortion Pills In Stanger...
Abortion Clinic In Stanger ](+27832195400*)[ 🏥 Safe Abortion Pills In Stanger...
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 
Abortion Pill Prices Aliwal North ](+27832195400*)[ 🏥 Women's Abortion Clinic...
Abortion Pill Prices Aliwal North ](+27832195400*)[ 🏥 Women's Abortion Clinic...Abortion Pill Prices Aliwal North ](+27832195400*)[ 🏥 Women's Abortion Clinic...
Abortion Pill Prices Aliwal North ](+27832195400*)[ 🏥 Women's Abortion Clinic...
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
Abortion Pill Prices Rustenburg [(+27832195400*)] 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Rustenburg [(+27832195400*)] 🏥 Women's Abortion Clinic i...Abortion Pill Prices Rustenburg [(+27832195400*)] 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Rustenburg [(+27832195400*)] 🏥 Women's Abortion Clinic i...
 
[GRCPP] Introduction to concepts (C++20)
[GRCPP] Introduction to concepts (C++20)[GRCPP] Introduction to concepts (C++20)
[GRCPP] Introduction to concepts (C++20)
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 

DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup

  • 1. Analytics on Apache Cassandra, an Operational Distributed Database Victor Coustenoble Solutions Engineer victor.coustenoble@datastax.com @vizanalytics Paris Tech Talks Meetup March 24th 2015
  • 2. Apache Cassandra™ • Massively scalable, Open Source, NoSQL, distributed database built for modern, mission- critical online applications • Written in Java and is a hybrid of Amazon Dynamo and Google BigTable • Masterless with no single point of failure • Distributed and data center aware • 100% uptime • Predictable scaling • High Performance • Multi Data Center • Time Series • Tunable Consistency • Simple to Operate • CQL language • OpsCenter / DevCenter Dynamo BigTable BigTable: http://research.google.com/archive/bigtable-osdi06.pdf Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
  • 3. High Availability and Strong Consistency ! • A single node failure shouldn’t bring failure. • Replication Factor + Consistency Level = Success • This example: • RF = 3 • CL = QUORUM (= 51% replicas) ©2014 DataStax Confidential. Do not distribute without consent. 3 Node 1 1st copy Node 4 Node 5 Node 2 2nd copy Node 3 3rd copy Parallel Write Write CL=QUORUM 5 μs ack 12 μs ack 12 μs ack >51% ack – so request is a success CL(Read) + CL(Write) > RF => Strong Consistency
  • 4. Real-Time / Operational Big Data Use Cases Recommendation Engine Internet of Things Fraud Detection Risk Analysis Buyer Behaviour Analytics Telematics, Logistics Business Intelligence Infrastructure Monitoring …
  • 5. How to do analytics on Cassandra data ? Remember … Cassandra = NO JOIN , NO GROUP BY , Filter on PK only
  • 6. Cassandra needs a distributed processing framework Data model independent queries Cross-table operations (JOIN, UNION, etc.) Complex analytics (e.g. machine learning) Data transformation, aggregation, etc. Stream processing Much more …..
  • 7. Analytics on Cassandra There are 4 ways to do Analytics on Cassandra data: • Integrated Search (Solr) • Integrated Batch Analytics (Hadoop integrated) on Cassandra • External Batch Analytics (External Hadoop; certified with Cloudera, HortonWorks) • Integrated Near Real-Time Analytics (Spark) ©2014 DataStax Confidential. Do not distribute without consent. • Virtual multi data centers optimised as required – different workloads, hardware, availability etc.. • Cassandra will replicate the data for you – no ETL is necessary • Cassandra node started with Solr, Hadoop or Spark Cassandra Replication Transactions Analytics
  • 8. Enterprise Search • Built-in enterprise search on Cassandra data via Solr integration • Facets, Filtering, Geospatial search, Text Analysis, etc. • Near real-time search operations • Search queries from CQL and REST/Solr • Solr shortcomings: • No bottleneck. Client can read/write to any Solr node. • Search index partitioning and replication for scalability and availability. • Multi-DC support • Data durability (Solr lacks write-ahead log, data can be lost) 8 Cassandra Replication Customer Facing Search Nodes
  • 9. Batch Analytics - Hadoop • Integrated Hadoop 1.0.4 • CFS (Cassandra File System) , no HDFS • No Single Point of failure • No Hadoop complexity – every node is built the same • Hive / Pig / Sqoop / Mahout ©2014 DataStax Confidential. Do not distribute without consent. 9 Cassandra Replication Customer Facing Hadoop Nodes
  • 10. External Batch Analytics - BYOH Bring Your Own Hadoop External Hadoop Resource Manager Hive Request • Hadoop 2.0.x support • Cassandra Node as a Data Node • Ex: Hive submit jobs to Job tracker assigning tasks to Task trackers installed on C* nodes • Certified with Cloudera, HortonWorks Cassandra Nodes
  • 11. Real-Time Analytics - Spark • Tight integration with Cassandra • Distributed Processing • “In-memory Map/Reduce”, multi-thread, best for iterations • GraphX, MLLib, SparkSQL, Shark (Hive SQL like) • Spark Streaming - Real-time • DataStax / Databricks partnership • 10x – 100x speed of MapReduce ©2014 DataStax Confidential. Do not distribute without consent. 11 Cassandra Replication Customer Facing Spark Nodes « Big Data » SDK
  • 12. Real-time Big Data ©2014 DataStax Confidential. Do not distribute without consent. 12 Data Enrichment Batch Processing Machine Learning Pre-computed aggregates Data NO ETL
  • 13. Spark Use Cases 13 Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize data Schema migration, Data conversion
  • 14. Hot / Cold Data in a DataStax architecture © 2014 DataStax, All Rights Reserved. Company Confidential Hot Data Online Operational Application Cold Data Offline Application DataStax Cassandra Enterprise 14
  • 15. DataStax Enterprise vs. Hadoop ©2014 DataStax Confidential. Do not distribute without consent.
  • 16. NoSQL Matters Paris © 2014 DataStax, All Rights Reserved. Company Confidential 16 Tracks from Duy Hai Doan – Cassandra technical advocate @doanduyhai • Day 1 (Thursday 26) 13:45 – 17:45 Training : Introduction to Apache Cassandra, CQL and Data Modelling • Day 2 (Friday 27) 16:30 – 17:15 Conference : Real time analytics with Cassandra and Spark
  • 18. Thanks We power the big data apps that transform business. ©2013 DataStax Confidential. Do not distribute without consent.

Editor's Notes

  1. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance. It uses aspects of Dynamos partitioning and replication and a log-structured data model similar to Bigtable’s. It takes its distribution algorithm from Dynamo and its data model from Bigtable. Cassandra is a reinvented database which is lightening fast and always on ideal for todays online applications where relational databases like Oracle can’t keep up. This means that in todays world, cassandra stores and processes real time information at fast, predictive performance and built in fault tolerance
  2. Predictive analytics Does this simple architecture look familiar to you? Lambda Nathan Marz
  3. DUYHAI