SlideShare a Scribd company logo
1 of 49
Download to read offline
@doanduyhai
Real time data processing
with Spark & Cassandra
DuyHai DOAN, Technical Advocate
@doanduyhai
Who Am I ?!
Duy Hai DOAN
Cassandra technical advocate
•  talks, meetups, confs
•  open-source devs (Achilles, …)
•  OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
@doanduyhai
Datastax!
•  Founded in April 2010 
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the Fortune 100), 200+ employees
•  Headquarter in San Francisco Bay area
•  EU headquarter in London, offices in France and Germany
•  Datastax Enterprise = OSS Cassandra + extra features
3
Spark & Cassandra Integration!
Spark & its eco-system!
Cassandra & token ranges!
Stand-alone cluster deployment!
!
@doanduyhai
What is Apache Spark ?!
Created at 

Apache Project since 2010

General data processing framework

MapReduce is not the A & ΩΩ

One-framework-many-components approach
5
@doanduyhai
Spark characteristics!
Fast
•  10x-100x faster than Hadoop MapReduce
•  In-memory storage
•  Single JVM process per node, multi-threaded

Easy
•  Rich Scala, Java and Python APIs (R is coming …)
•  2x-5x less code
•  Interactive shell

6
@doanduyhai
Spark code example!
Setup
Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …)
val$conf$=$new$SparkConf(true)$
$ .setAppName("basic_example")$
$ .setMaster("local[3]")$
$
val$sc$=$new$SparkContext(conf)$
val$people$=$List(("jdoe","John$DOE",$33),$
$$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$
$$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$
7
@doanduyhai
RDDs!
RDD = Resilient Distributed Dataset

val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$
$
val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$
$ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$
$
val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$
$
val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$
8
@doanduyhai
RDDs!
RDD[A] = distributed collection of A 
•  RDD[Person]
•  RDD[(String,Int)], …

RDD[A] split into partitions

Partitions distributed over n workers à parallel computing
9
@doanduyhai
Spark eco-system!
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLibGraphXSpark SQL
Persistence
Cluster Manager
…
10
@doanduyhai
Spark eco-system!
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLibGraphXSpark SQL
Persistence
Cluster Manager
…
11
@doanduyhai
What is Apache Cassandra?!
Created at 

Apache Project since 2009

Distributed NoSQL database

Eventual consistency (A & P of the CAP theorem)

Distributed table abstraction
12
@doanduyhai
Cassandra data distribution reminder!
Random: hash of #partition → token = hash(#p)

Hash: ]-X, X]

X = huge number (264/2)

 n1
n2
n3
n4
n5
n6
n7
n8
13
@doanduyhai
Cassandra token ranges!
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]

Murmur3 hash function
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
14
@doanduyhai
Linear scalability!
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
15
@doanduyhai
Linear scalability!
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
16
@doanduyhai
Cassandra Query Language (CQL)!

INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);

UPDATE users SET age = 34 WHERE login = ‘jdoe’;

DELETE age FROM users WHERE login = ‘jdoe’;

SELECT age FROM users WHERE login = ‘jdoe’;
17
@doanduyhai
Why Spark on Cassandra ?!
Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

For Spark
18
@doanduyhai
Why Spark on Cassandra ?!
Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

Cross-table operations (JOIN, UNION, etc.)

Real-time/batch processing

Complex analytics (e.g. machine learning)
For Spark
For Cassandra
19
@doanduyhai
Use Cases!
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
20
@doanduyhai
Cluster deployment!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Stand-alone cluster
21
@doanduyhai
Cluster deployment!
Spark Master
Spark Worker Spark Worker Spark Worker Spark Worker
Executor Executor Executor Executor
Driver Program
Cassandra – Spark placement
1 Cassandra process ⟷ 1 Spark worker
C* C* C* C*
22
Spark & Cassandra Connector!
Core API!
SparkSQL!
SparkStreaming!
@doanduyhai
Connector architecture!
All Cassandra types supported and converted to Scala types

Server side data filtering (SELECT … WHERE …)

Use Java-driver underneath
!
Scala and Java support
24
@doanduyhai
Connector architecture – Core API!
Cassandra tables exposed as Spark RDDs

Read from and write to Cassandra

Mapping of C* tables and rows to Scala objects
•  CassandraRow
•  Scala case class (object mapper)
•  Scala tuples 


25
@doanduyhai
Connector architecture – Spark SQL !

Mapping of Cassandra table to SchemaRDD
•  CassandraSQLRow à SparkRow
•  custom query plan
•  push predicates to CQL for early filtering

SELECT * FROM user_emails WHERE login = ‘jdoe’;
26
@doanduyhai
Connector architecture – Spark Streaming !

Streaming data INTO Cassandra table
•  trivial setup
•  be careful about your Cassandra data model !!!
Streaming data OUT of Cassandra tables ?
•  work in progress …
27
Connector API !
Connector API!
Data Locality Implementation!
@doanduyhai
Connector API!
Connecting to Cassandra
!//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects!
!import!com.datastax.driver.spark._!
!!
!//!Spark!connection!options!
!val!conf!=!new!SparkConf(true)!
! .setMaster("spark://192.168.123.10:7077")!
! .setAppName("cassandra.demo")!
! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact!
! .set("cassandra.username",!"cassandra")!
! .set("cassandra.password",!"cassandra")!
!
!val!sc!=!new!SparkContext(conf)!
29
@doanduyhai
Connector API!
Preparing test data
CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);&
&
INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);&
INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);&
30
@doanduyhai
Connector API!
Reading from Cassandra
!//!Use!table!as!RDD!
!val!rdd!=!sc.cassandraTable("test",!"words")!
!//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]!
!
!rdd.toArray.foreach(println)!
!//!CassandraRow[word:!bar,!count:!30]!
!//!CassandraRow[word:!foo,!count:!20]!
!
!rdd.columnNames!!!!//!Stream(word,!count)!
!rdd.size!!!!!!!!!!!//!2!
!
!val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]!
!
!firstRow.getInt("count")!!//!Int!=!30!
31
@doanduyhai
Connector API!
Writing data to Cassandra
!val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!!
!//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!!
!
!newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))!
SELECT&*&FROM&test.words;&
&
&&&&word&|&count&&&
&&&999999+9999999&
&&&&&bar&|&&&&30&
&&&&&foo&|&&&&20&
&&&&&cat&|&&&&40&
&&&&&fox&|&&&&50&&
32
@doanduyhai
Remember token ranges ?!
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
33
@doanduyhai
Data Locality!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Spark partition RDD
Cassandra
tokens ranges
34
@doanduyhai
Data Locality!
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Use Murmur3Partitioner

35
@doanduyhai
Read data locality!
Read from Cassandra
Spark shuffle operations
36
@doanduyhai
Repartition before write !
Write to Cassandra
rdd.repartitionByCassandraReplica("keyspace","table")
37
@doanduyhai
Or async batch writes!
Async batches fan-out writes to Cassandra
Spark shuffle operations
38
@doanduyhai
Write data locality!
39
•  either stream data with Spark using repartitionByCassandraReplica()
•  or flush data to Cassandra by async batches
•  in any case, there will be data movement on network (sorry no magic)
@doanduyhai
Joins with data locality!
40

CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));


CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
// Repartition RDDs by "artists" PK, which is "name"
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
// Join with "artists" table, selecting only "name" and "country" columns
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
@doanduyhai
Joins pipeline with data locality!
41
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
.map(…)
.filter(…)
.groupByKey()
.mapValues(…)
.repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS)
.joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS)
…
!
!
@doanduyhai
Perfect data locality scenario!
42
•  read localy from Cassandra
•  use operations that do not require shuffle in Spark (map, filter, …)
•  repartitionbyCassandraReplica()
•  à to a table having same partition key as original table
•  save back into this Cassandra table
Demo
https://github.com/doanduyhai/Cassandra-Spark-Demo
@doanduyhai
What’s for future ?!
Datastax Enterprise 4.7 
•  Cassandra + Spark + Solr as your analytics platform

Filter out most data possible with Solr from Cassandra

Fetch the filtered data in Spark and perform aggregations

Save back final data into Cassandra

44
@doanduyhai
What’s for future ?!
What’s about data locality ?
45
@doanduyhai
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")
What’s for future ?!
1.  compute Spark partitions using Cassandra token ranges
2.  on each partition, use Solr for local data filtering (no fan out !)
3.  fetch data back into Spark for aggregations
4.  repeat 1 – 3 as many times as necessary 
46
@doanduyhai
What’s for future ?!
47

SELECT … FROM … 

WHERE token(#partition)> 3X/8 

AND token(#partition)<= 4X/8

AND solr_query='full text search expression';
1
2
3
Advantages of same JVM Cassandra + Solr integration
1
Single-pass local full text search (no fan out) 2
Data retrieval
D: ] 3X/8, 4X/8]
Q & R
! "!
Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

More Related Content

What's hot

Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceDuyhai Doan
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016Duyhai Doan
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGDuyhai Doan
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentationDuyhai Doan
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraRustam Aliyev
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaDataStax Academy
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016Duyhai Doan
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search rideDuyhai Doan
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopPatricia Gorla
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
 

What's hot (20)

Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Spark cassandra integration, theory and practice
Spark cassandra integration, theory and practiceSpark cassandra integration, theory and practice
Spark cassandra integration, theory and practice
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ ING
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentation
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search ride
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 

Viewers also liked

Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016Duyhai Doan
 
Introduction to KillrChat
Introduction to KillrChatIntroduction to KillrChat
Introduction to KillrChatDuyhai Doan
 
KillrChat Data Modeling
KillrChat Data ModelingKillrChat Data Modeling
KillrChat Data ModelingDuyhai Doan
 
KillrChat presentation
KillrChat presentationKillrChat presentation
KillrChat presentationDuyhai Doan
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
 
Cassandra introduction @ ParisJUG
Cassandra introduction @ ParisJUGCassandra introduction @ ParisJUG
Cassandra introduction @ ParisJUGDuyhai Doan
 
Cassandra drivers and libraries
Cassandra drivers and librariesCassandra drivers and libraries
Cassandra drivers and librariesDuyhai Doan
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jugDuyhai Doan
 
Cassandra introduction @ NantesJUG
Cassandra introduction @ NantesJUGCassandra introduction @ NantesJUG
Cassandra introduction @ NantesJUGDuyhai Doan
 
Apache Zeppelin @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016Apache Zeppelin @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016Duyhai Doan
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandraDuyhai Doan
 
Cassandra introduction at FinishJUG
Cassandra introduction at FinishJUGCassandra introduction at FinishJUG
Cassandra introduction at FinishJUGDuyhai Doan
 
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaCassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
 
Data stax academy
Data stax academyData stax academy
Data stax academyDuyhai Doan
 
Libon cassandra summiteu2014
Libon cassandra summiteu2014Libon cassandra summiteu2014
Libon cassandra summiteu2014Duyhai Doan
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-PatternsMatthew Dennis
 
Cassandra 3 new features @ Geecon Krakow 2016
Cassandra 3 new features  @ Geecon Krakow 2016Cassandra 3 new features  @ Geecon Krakow 2016
Cassandra 3 new features @ Geecon Krakow 2016Duyhai Doan
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
 
Cassandra for the ops dos and donts
Cassandra for the ops   dos and dontsCassandra for the ops   dos and donts
Cassandra for the ops dos and dontsDuyhai Doan
 

Viewers also liked (20)

Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
 
Introduction to KillrChat
Introduction to KillrChatIntroduction to KillrChat
Introduction to KillrChat
 
KillrChat Data Modeling
KillrChat Data ModelingKillrChat Data Modeling
KillrChat Data Modeling
 
KillrChat presentation
KillrChat presentationKillrChat presentation
KillrChat presentation
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
 
Cassandra introduction @ ParisJUG
Cassandra introduction @ ParisJUGCassandra introduction @ ParisJUG
Cassandra introduction @ ParisJUG
 
Cassandra drivers and libraries
Cassandra drivers and librariesCassandra drivers and libraries
Cassandra drivers and libraries
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
 
Cassandra introduction @ NantesJUG
Cassandra introduction @ NantesJUGCassandra introduction @ NantesJUG
Cassandra introduction @ NantesJUG
 
Apache Zeppelin @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016Apache Zeppelin @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandra
 
Cassandra introduction at FinishJUG
Cassandra introduction at FinishJUGCassandra introduction at FinishJUG
Cassandra introduction at FinishJUG
 
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaCassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
 
Data stax academy
Data stax academyData stax academy
Data stax academy
 
Libon cassandra summiteu2014
Libon cassandra summiteu2014Libon cassandra summiteu2014
Libon cassandra summiteu2014
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
Cassandra 3 new features @ Geecon Krakow 2016
Cassandra 3 new features  @ Geecon Krakow 2016Cassandra 3 new features  @ Geecon Krakow 2016
Cassandra 3 new features @ Geecon Krakow 2016
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystem
 
Cassandra for the ops dos and donts
Cassandra for the ops   dos and dontsCassandra for the ops   dos and donts
Cassandra for the ops dos and donts
 

Similar to Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...hamidsamadi
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and TricksImply
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronSpark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronDuyhai Doan
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresSerena Villata
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingCascading
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
 

Similar to Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris (20)

"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and Tricks
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronSpark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotron
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph Stores
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Cassandra: Now and the Future @ Yahoo! JAPAN
Cassandra: Now and the Future @ Yahoo! JAPANCassandra: Now and the Future @ Yahoo! JAPAN
Cassandra: Now and the Future @ Yahoo! JAPAN
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 

More from Duyhai Doan

Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...Duyhai Doan
 
Le futur d'apache cassandra
Le futur d'apache cassandraLe futur d'apache cassandra
Le futur d'apache cassandraDuyhai Doan
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplDuyhai Doan
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysDuyhai Doan
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDuyhai Doan
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016Duyhai Doan
 
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Algorithme distribués pour big data saison 2 @DevoxxFR 2016Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Algorithme distribués pour big data saison 2 @DevoxxFR 2016Duyhai Doan
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016Duyhai Doan
 
Cassandra UDF and Materialized Views
Cassandra UDF and Materialized ViewsCassandra UDF and Materialized Views
Cassandra UDF and Materialized ViewsDuyhai Doan
 
Apache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystemApache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystemDuyhai Doan
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDuyhai Doan
 
Algorithmes distribues pour le big data @ DevoxxFR 2015
Algorithmes distribues pour le big data @ DevoxxFR 2015Algorithmes distribues pour le big data @ DevoxxFR 2015
Algorithmes distribues pour le big data @ DevoxxFR 2015Duyhai Doan
 

More from Duyhai Doan (12)

Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
 
Le futur d'apache cassandra
Le futur d'apache cassandraLe futur d'apache cassandra
Le futur d'apache cassandra
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxpl
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev days
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basics
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Algorithme distribués pour big data saison 2 @DevoxxFR 2016Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016
 
Cassandra UDF and Materialized Views
Cassandra UDF and Materialized ViewsCassandra UDF and Materialized Views
Cassandra UDF and Materialized Views
 
Apache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystemApache zeppelin, the missing component for the big data ecosystem
Apache zeppelin, the missing component for the big data ecosystem
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
 
Algorithmes distribues pour le big data @ DevoxxFR 2015
Algorithmes distribues pour le big data @ DevoxxFR 2015Algorithmes distribues pour le big data @ DevoxxFR 2015
Algorithmes distribues pour le big data @ DevoxxFR 2015
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

  • 1. @doanduyhai Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate
  • 2. @doanduyhai Who Am I ?! Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS Cassandra point of contact ☞ duy_hai.doan@datastax.com ☞ @doanduyhai 2
  • 3. @doanduyhai Datastax! •  Founded in April 2010 •  We contribute a lot to Apache Cassandra™ •  400+ customers (25 of the Fortune 100), 200+ employees •  Headquarter in San Francisco Bay area •  EU headquarter in London, offices in France and Germany •  Datastax Enterprise = OSS Cassandra + extra features 3
  • 4. Spark & Cassandra Integration! Spark & its eco-system! Cassandra & token ranges! Stand-alone cluster deployment! !
  • 5. @doanduyhai What is Apache Spark ?! Created at Apache Project since 2010 General data processing framework MapReduce is not the A & ΩΩ One-framework-many-components approach 5
  • 6. @doanduyhai Spark characteristics! Fast •  10x-100x faster than Hadoop MapReduce •  In-memory storage •  Single JVM process per node, multi-threaded Easy •  Rich Scala, Java and Python APIs (R is coming …) •  2x-5x less code •  Interactive shell 6
  • 7. @doanduyhai Spark code example! Setup Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …) val$conf$=$new$SparkConf(true)$ $ .setAppName("basic_example")$ $ .setMaster("local[3]")$ $ val$sc$=$new$SparkContext(conf)$ val$people$=$List(("jdoe","John$DOE",$33),$ $$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$ $$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$ 7
  • 8. @doanduyhai RDDs! RDD = Resilient Distributed Dataset val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$ $ val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$ $ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$ $ val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$ $ val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$ 8
  • 9. @doanduyhai RDDs! RDD[A] = distributed collection of A •  RDD[Person] •  RDD[(String,Int)], … RDD[A] split into partitions Partitions distributed over n workers à parallel computing 9
  • 10. @doanduyhai Spark eco-system! Local Standalone cluster YARN Mesos Spark Core Engine (Scala/Java/Python) Spark Streaming MLLibGraphXSpark SQL Persistence Cluster Manager … 10
  • 11. @doanduyhai Spark eco-system! Local Standalone cluster YARN Mesos Spark Core Engine (Scala/Java/Python) Spark Streaming MLLibGraphXSpark SQL Persistence Cluster Manager … 11
  • 12. @doanduyhai What is Apache Cassandra?! Created at Apache Project since 2009 Distributed NoSQL database Eventual consistency (A & P of the CAP theorem) Distributed table abstraction 12
  • 13. @doanduyhai Cassandra data distribution reminder! Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2) n1 n2 n3 n4 n5 n6 n7 n8 13
  • 14. @doanduyhai Cassandra token ranges! A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] Murmur3 hash function n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H 14
  • 17. @doanduyhai Cassandra Query Language (CQL)! INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); UPDATE users SET age = 34 WHERE login = ‘jdoe’; DELETE age FROM users WHERE login = ‘jdoe’; SELECT age FROM users WHERE login = ‘jdoe’; 17
  • 18. @doanduyhai Why Spark on Cassandra ?! Reliable persistent store (HA) Structured data (Cassandra CQL à Dataframe API) Multi data-center !!! For Spark 18
  • 19. @doanduyhai Why Spark on Cassandra ?! Reliable persistent store (HA) Structured data (Cassandra CQL à Dataframe API) Multi data-center !!! Cross-table operations (JOIN, UNION, etc.) Real-time/batch processing Complex analytics (e.g. machine learning) For Spark For Cassandra 19
  • 20. @doanduyhai Use Cases! Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize data Schema migration, Data conversion 20
  • 22. @doanduyhai Cluster deployment! Spark Master Spark Worker Spark Worker Spark Worker Spark Worker Executor Executor Executor Executor Driver Program Cassandra – Spark placement 1 Cassandra process ⟷ 1 Spark worker C* C* C* C* 22
  • 23. Spark & Cassandra Connector! Core API! SparkSQL! SparkStreaming!
  • 24. @doanduyhai Connector architecture! All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath ! Scala and Java support 24
  • 25. @doanduyhai Connector architecture – Core API! Cassandra tables exposed as Spark RDDs Read from and write to Cassandra Mapping of C* tables and rows to Scala objects •  CassandraRow •  Scala case class (object mapper) •  Scala tuples 25
  • 26. @doanduyhai Connector architecture – Spark SQL ! Mapping of Cassandra table to SchemaRDD •  CassandraSQLRow à SparkRow •  custom query plan •  push predicates to CQL for early filtering SELECT * FROM user_emails WHERE login = ‘jdoe’; 26
  • 27. @doanduyhai Connector architecture – Spark Streaming ! Streaming data INTO Cassandra table •  trivial setup •  be careful about your Cassandra data model !!! Streaming data OUT of Cassandra tables ? •  work in progress … 27
  • 28. Connector API ! Connector API! Data Locality Implementation!
  • 29. @doanduyhai Connector API! Connecting to Cassandra !//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects! !import!com.datastax.driver.spark._! !! !//!Spark!connection!options! !val!conf!=!new!SparkConf(true)! ! .setMaster("spark://192.168.123.10:7077")! ! .setAppName("cassandra.demo")! ! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact! ! .set("cassandra.username",!"cassandra")! ! .set("cassandra.password",!"cassandra")! ! !val!sc!=!new!SparkContext(conf)! 29
  • 30. @doanduyhai Connector API! Preparing test data CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);& & INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);& INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);& 30
  • 31. @doanduyhai Connector API! Reading from Cassandra !//!Use!table!as!RDD! !val!rdd!=!sc.cassandraTable("test",!"words")! !//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]! ! !rdd.toArray.foreach(println)! !//!CassandraRow[word:!bar,!count:!30]! !//!CassandraRow[word:!foo,!count:!20]! ! !rdd.columnNames!!!!//!Stream(word,!count)! !rdd.size!!!!!!!!!!!//!2! ! !val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]! ! !firstRow.getInt("count")!!//!Int!=!30! 31
  • 32. @doanduyhai Connector API! Writing data to Cassandra !val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!! !//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!! ! !newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))! SELECT&*&FROM&test.words;& & &&&&word&|&count&&& &&&999999+9999999& &&&&&bar&|&&&&30& &&&&&foo&|&&&&20& &&&&&cat&|&&&&40& &&&&&fox&|&&&&50&& 32
  • 33. @doanduyhai Remember token ranges ?! A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H 33
  • 36. @doanduyhai Read data locality! Read from Cassandra Spark shuffle operations 36
  • 37. @doanduyhai Repartition before write ! Write to Cassandra rdd.repartitionByCassandraReplica("keyspace","table") 37
  • 38. @doanduyhai Or async batch writes! Async batches fan-out writes to Cassandra Spark shuffle operations 38
  • 39. @doanduyhai Write data locality! 39 •  either stream data with Spark using repartitionByCassandraReplica() •  or flush data to Cassandra by async batches •  in any case, there will be data movement on network (sorry no magic)
  • 40. @doanduyhai Joins with data locality! 40 CREATE TABLE artists(name text, style text, … PRIMARY KEY(name)); CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title)); val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))
  • 41. @doanduyhai Joins pipeline with data locality! 41 val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")) .map(…) .filter(…) .groupByKey() .mapValues(…) .repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS) .joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS) … ! !
  • 42. @doanduyhai Perfect data locality scenario! 42 •  read localy from Cassandra •  use operations that do not require shuffle in Spark (map, filter, …) •  repartitionbyCassandraReplica() •  à to a table having same partition key as original table •  save back into this Cassandra table
  • 44. @doanduyhai What’s for future ?! Datastax Enterprise 4.7 •  Cassandra + Spark + Solr as your analytics platform Filter out most data possible with Solr from Cassandra Fetch the filtered data in Spark and perform aggregations Save back final data into Cassandra 44
  • 45. @doanduyhai What’s for future ?! What’s about data locality ? 45
  • 46. @doanduyhai val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ") What’s for future ?! 1.  compute Spark partitions using Cassandra token ranges 2.  on each partition, use Solr for local data filtering (no fan out !) 3.  fetch data back into Spark for aggregations 4.  repeat 1 – 3 as many times as necessary 46
  • 47. @doanduyhai What’s for future ?! 47 SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression'; 1 2 3 Advantages of same JVM Cassandra + Solr integration 1 Single-pass local full text search (no fan out) 2 Data retrieval D: ] 3X/8, 4X/8]
  • 48. Q & R ! "!