SlideShare a Scribd company logo
PySpark Cassandra
Analytics with Cassandra and PySpark
+ +
Frens Jan Rumph
• Database and processing architect
at Target Holding
• Contact me at:
frens.jan.rumph@target-holding.nl
Target Holding
• Machine Learning Company
• Timeseries:
– Prediction, Search,
Anomaly detection, ...
• Text:
– Search, matching (e.g. jobs
and resumes), ...
• Markets:
– media
– human resources
– infrastructure
(energy, waterworks, ...)
– health
PySpark Cassandra
• Techology background
– Cassandra, Spark and PySpark
• PySpark Cassandra
– Introduction
– Features and use cases
– Getting started
– Operators and examples
Technology background
Cassandra, Spark and PySpark
Cassandra
• Distributed database
• Originated at
Facebook
• Roots in Amazon
Dynamo
Cassandra Query Language
Main 'user interface' of Cassandra
with a SQL feel (tables with rows)
• DML
– Insert into ..., Select from ...,
Update ..., Delete from ...
• DDL
– Create table ..., Create index ...
• Column types:
– Numbers, strings, etc.,
– Collections (lists, sets and maps)
– Counters
Distribution and replication
• Distributed map of
ordered maps
– Under the hood some
updates in C* 3
• Consistent hashing
• Replication along ring
– keys usually 'placed on
ring' through hashing
Image by DataStax
Local datastructures and 2i
• Memtables
• SSTables
– Ordered within partition on
clustering columns
• Various caches
• Various indices
• Materialized views
– manually < C* 3.0
– similar to normal tables
• Secondary indices
– scatter gather model
– 'normal'
– or 'search'
Spark
• Distributed data
processing engine
• 'Doesn't touch disk if
it doesn't have to'
• Layers on top of data
sources
– HDFS, Cassandra,
Elasticsearch, JDBC, ...
Resiliant Distributed Dataset
• Partitioned (and distributed) collection of rows
• Part of computational graph
– RDD has linkage to 'source' RDD(s)
• DataFrame, DataSet / Frame with stronger
typing and declarative querying layered on top
Transformations and actions
• Narrow transformations
are 'data-local'
– map, filter, ...
• Wide transformations
aren't
– join, sort, reduce, group, ...
• Actions to
– read results to the driver
– write results to disk,
database, ...
Image by Apache
Distribution / topology
Where your app 'lives'
Coordinate resources
Standalone MESOS or YARN
Where the
work happens
Image by Apache
PySpark
• Wrapper around Java APIs
– JVM for data shipping when working with Python RDDs
– 'Query language' when working with DataFrames
• CPython interpreters as (extra) executors
– essentially the multiprocessing model
– but distributed
– cpython executors forked per job (not application)
Pickle
• Object serialization shipped with Python
• Pickle used for messaging between
CPython interpreters and JVM
• cPickle / cloudpickle in CPython
• Py4J in the JVM
PySpark Cassandra
PySpark Cassandra
• Developed at Target Holding
– use a lot of python
– and Cassandra
– Spark option for processing
• Build on Spark Cassandra Connector
– Datastax provides Spark Cassandra Connector
– Python + Cassandra link was missing
– PySpark Cassandra
Features and use cases
PySpark Cassandra
Features
• Distributed C* table scanning into RDD's
• Writing RDD's and DStreams to C*
• Joining RDD's and DStreams with C* tables
+ +
Use cases
• Perform bulk 'queries' you normally can't
(C* doesn't do group by or join)
takes a prohibitive amount of time /
or just because it's easy once it's set up
• Data wrangling ('cooking' features, etc.)
• As a streaming processing platform
Use cases at Target Holding
PySpark Cassandra
Media metadata processing
• We disambiguated close to 90 million
'authorships' using names,
contact details, publication
keywords
– In order to build a analytical
applications for a large publisher of
scientific journals
• Spark / PySpark Cassandra for data wrangling
Earthquake monitoring
• We are building an monitoring system
for use with many low cost vibration sensors
• Spark / PySpark Cassandra for
– Enriching the event stream
– Saving the event stream
– Bulk processing
– Anomaly detection
Processing time series data
• We collect time series data in various fields
– Energy (electricity and gas usage)
– Music (tracking online music and video portals)
• Spark / PySpark Cassandra for
– data wrangling
– rolling up data
– bulk forecasting
– anomaly detection
Getting started
PySpark Cassandra
Getting started
• 'Homepage': github.com/
TargetHolding/pyspark-cassandra
• Available on: spark-packages.org/package/
TargetHolding/pyspark-cassandra
• Also read: github.com
/datastax/spark-cassandra-connector
Compatibility
• Spark 1.5 and 1.6
– (supported older versions in the past)
• Cassandra 2.1.5, 2.2 and 3
• Python 2.7 and 3
+ +
High over
• Read from and write to C* using Spark as a
colocated distributed processing platform
Image by DataStax
Software setup
Python part
PySpark Cassandra
Scala part
PySpark Cassandra
Datastax Spark
Cassandra Connector
PySpark Spark
CPython JVM JVM
Cassandra
Application
Submit script
spark-submit
--packages TargetHolding/pyspark-cassandra:0.3.5
--conf spark.cassandra.connection.host=cas1,cas2,cas3
--master spark://spark-master:7077
yourscript.py
import ...
conf = SparkConf()
sc = CassandraSparkContext(conf=conf)
# your script
PySpark shell
IPYTHON_OPTS=notebook
PYSPARK_DRIVER_PYTHON=ipython
pyspark
--packages TargetHolding/pyspark-cassandra:0.3.5
--conf ...
...
import pyspark_cassandra
Operators and examples
PySpark Cassandra
Operators
• Scan
• Project (select)
• Filter (where)
• Limit, etc.
• Count
• 'Spanning'
• Join
• Save
Scan
• cassandraTable() to scanning C*
– Determine basic token ranges
– Group them into partitions
• taking size and location into account
– Execute (concurrent) CQL queries to C*
rows = sc.cassandraTable('keyspace', 'table')
Scan
• Basically executing this query many times:
SELECT columns
FROM keyspace.table
WHERE
token(pk) > ? and
token(pk) < ?
filter
ORDER BY ...
LIMIT ...
ALLOW FILTERING
Scan
• Quite tunable if neccessary
sc.cassandraTable('keyspace', 'table',
row_format=RowFormat.DICT, # ROW, DICT or TUPLE
split_count=1000, # no partitions (splits)
split_size=100000, # size of a partition
fetch_size=1000, # query page size
consistency_level='ALL',
metrics_enabled=True
)
Project / Select
• To make things go a little faster,
select only the columns you need.
– This saves in communication:
C* ↔ Spark JVM ↔ CPython
sc.cassandraTable(...).select('col1', 'col2', ...)
Types
CQL Python
ascii unicode string
bigint long
blob bytearray
boolean boolean
counter int, long
decimal decimal
double float
float float
inet str
int int
CQL Python
set set
list list
text unicode string
timestamp datetime.datetime
timeuuid uuid.UUID
varchar unicode string
varint long
uuid uuid.UUID
UDT
pyspark_cassandra.
UDT
Key by primary key
• Cassandra RDD's can be keyed by primary key
– yielding an RDD of key value pairs
– Keying by partition key not yet supported
sc.cassandraTable(...).by_primary_key()
Filter / where
• Clauses on primary keys, clustering columns or
secondary indices can be pushed down
– If a where with allow filtering works in cql
• Otherwise resort to RDD.filter or DF.filter
sc.cassandraTable(...).where(
'col2 > ?', datetime.now() - timedelta(days=14)
)
Combine with 2i
• With the cassandra lucene index
sc.cassandraTable(...).where('lucene = ?', '''{
filter : {
field: "loc",
type: "geo_bbox",
min_latitude: 53.217, min_longitude: 6.521,
max_latitude: 53.219, max_longitude: 6.523
}
}''')
Limit, take and first
• limit() the number of rows per query
– there are as least as many queries as there are token ranges
• take(n) at most n rows from the RDD
– applying limit to make it just a tad bit faster
sc.cassandraTable(...).limit(1)...
sc.cassandraTable(...).take(3)
sc.cassandraTable(...).first()
Push down count
• cassandraCount() pushes down count(*)
queries down to C*
– counting in partitions and then reduced
• When all you want to do is count records in C*
– doesn't force caching
sc.cassandraTable(...).cassandraCount()
Spanning
• Wide rows in C* are retrieved in order, are
consecutive and don't cross partition boundaries
• spanBy() is like groupBy() for wide rows
sc.cassandraTable(...).spanBy('doc_id')
Save
• Save any PySpark RDD to C*
– for as long as it consists of dicts, tuples or Rows
rdd.saveToCassandra('keyspace', 'table', ...)
Save
rows = [dict(
key = k,
stamp = datetime.now(),
val = random() * 10,
tags = ['a', 'b', 'c'],
options = dict(foo='bar', baz='qux'),
) for k in ('x', 'y', 'z')]
rdd = sc.parallelize(rows)
rdd.saveToCassandra('keyspace', 'table')
Save
rdd.saveToCassandra(...,
columns = ('col1', 'col2'), # The columns to save
/ how to interpret the elements in
a tuple
row_format = RowFormat.DICT,# RDD format hint
keyed=True, # Whether RDD are key value pairs
)
Save
rdd.saveToCassandra(...,
batch_size = 16*1024, # Max. size of a batch
batch_buffer_size = 1000, # Max. pending batches
batch_grouping_key = "partition" # How batches are formed
any / replicaset / partition
consistency_level = "LOCAL_ONE",
parallelism_level = 8, # Max. batches in flight
throughput_mibps = MAX_LONG,# Max. MB/s
ttl = timedelta(days=3), # TTL
metrics_enabled = False
)
Save DStream
• Just like saving an RDD!
• But then for every micro-batch in the DStream
dstream
.map(lambda e: e[1])
.filter(lambda v: v > 3)
.saveToCassandra('keyspace', 'table', ...)
Join
• Join (inner) any RDD with a C* table
• No outer joins supported
rdd.joinWithCassandraTable('keyspace', 'table')
.on('id')
.select('col1', 'col2')
Join
• Query per PK in left RDD
for row(s) in joined table
• Somewhat similar to hash join
• Usual caveats of skew, 'shuffle' overhead, etc.
• (repartitionByCassandraReplica not yet supported)
Future work
PySpark Cassandra
Future work
• Main engine:
– Build on DF/S more?
– Build on native python cassandra driver?
– Use marshal or ... instead of pickle?
• More features:
– repartitioning
– multi cluster
– expose C* session to python
• Suggestions?
We're hiring
http://www.target-holding.nl/organisatie/vacatures
data scientists, backend and database
engineers, webdevelopers, ...
Q&A

More Related Content

What's hot

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Víctor Zabalza
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
Víctor Zabalza
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
MapR Technologies
 

What's hot (20)

Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 

Similar to PySpark Cassandra - Amsterdam Spark Meetup

Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
ASI Data Science
 

Similar to PySpark Cassandra - Amsterdam Spark Meetup (20)

Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Target Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataTarget Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big Data
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan OttTrivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 

PySpark Cassandra - Amsterdam Spark Meetup

  • 1. PySpark Cassandra Analytics with Cassandra and PySpark + +
  • 2. Frens Jan Rumph • Database and processing architect at Target Holding • Contact me at: frens.jan.rumph@target-holding.nl
  • 3. Target Holding • Machine Learning Company • Timeseries: – Prediction, Search, Anomaly detection, ... • Text: – Search, matching (e.g. jobs and resumes), ... • Markets: – media – human resources – infrastructure (energy, waterworks, ...) – health
  • 4. PySpark Cassandra • Techology background – Cassandra, Spark and PySpark • PySpark Cassandra – Introduction – Features and use cases – Getting started – Operators and examples
  • 6. Cassandra • Distributed database • Originated at Facebook • Roots in Amazon Dynamo
  • 7. Cassandra Query Language Main 'user interface' of Cassandra with a SQL feel (tables with rows) • DML – Insert into ..., Select from ..., Update ..., Delete from ... • DDL – Create table ..., Create index ... • Column types: – Numbers, strings, etc., – Collections (lists, sets and maps) – Counters
  • 8. Distribution and replication • Distributed map of ordered maps – Under the hood some updates in C* 3 • Consistent hashing • Replication along ring – keys usually 'placed on ring' through hashing Image by DataStax
  • 9. Local datastructures and 2i • Memtables • SSTables – Ordered within partition on clustering columns • Various caches • Various indices • Materialized views – manually < C* 3.0 – similar to normal tables • Secondary indices – scatter gather model – 'normal' – or 'search'
  • 10. Spark • Distributed data processing engine • 'Doesn't touch disk if it doesn't have to' • Layers on top of data sources – HDFS, Cassandra, Elasticsearch, JDBC, ...
  • 11. Resiliant Distributed Dataset • Partitioned (and distributed) collection of rows • Part of computational graph – RDD has linkage to 'source' RDD(s) • DataFrame, DataSet / Frame with stronger typing and declarative querying layered on top
  • 12. Transformations and actions • Narrow transformations are 'data-local' – map, filter, ... • Wide transformations aren't – join, sort, reduce, group, ... • Actions to – read results to the driver – write results to disk, database, ... Image by Apache
  • 13. Distribution / topology Where your app 'lives' Coordinate resources Standalone MESOS or YARN Where the work happens Image by Apache
  • 14. PySpark • Wrapper around Java APIs – JVM for data shipping when working with Python RDDs – 'Query language' when working with DataFrames • CPython interpreters as (extra) executors – essentially the multiprocessing model – but distributed – cpython executors forked per job (not application)
  • 15. Pickle • Object serialization shipped with Python • Pickle used for messaging between CPython interpreters and JVM • cPickle / cloudpickle in CPython • Py4J in the JVM
  • 17. PySpark Cassandra • Developed at Target Holding – use a lot of python – and Cassandra – Spark option for processing • Build on Spark Cassandra Connector – Datastax provides Spark Cassandra Connector – Python + Cassandra link was missing – PySpark Cassandra
  • 18. Features and use cases PySpark Cassandra
  • 19. Features • Distributed C* table scanning into RDD's • Writing RDD's and DStreams to C* • Joining RDD's and DStreams with C* tables + +
  • 20. Use cases • Perform bulk 'queries' you normally can't (C* doesn't do group by or join) takes a prohibitive amount of time / or just because it's easy once it's set up • Data wrangling ('cooking' features, etc.) • As a streaming processing platform
  • 21. Use cases at Target Holding PySpark Cassandra
  • 22. Media metadata processing • We disambiguated close to 90 million 'authorships' using names, contact details, publication keywords – In order to build a analytical applications for a large publisher of scientific journals • Spark / PySpark Cassandra for data wrangling
  • 23. Earthquake monitoring • We are building an monitoring system for use with many low cost vibration sensors • Spark / PySpark Cassandra for – Enriching the event stream – Saving the event stream – Bulk processing – Anomaly detection
  • 24. Processing time series data • We collect time series data in various fields – Energy (electricity and gas usage) – Music (tracking online music and video portals) • Spark / PySpark Cassandra for – data wrangling – rolling up data – bulk forecasting – anomaly detection
  • 26. Getting started • 'Homepage': github.com/ TargetHolding/pyspark-cassandra • Available on: spark-packages.org/package/ TargetHolding/pyspark-cassandra • Also read: github.com /datastax/spark-cassandra-connector
  • 27. Compatibility • Spark 1.5 and 1.6 – (supported older versions in the past) • Cassandra 2.1.5, 2.2 and 3 • Python 2.7 and 3 + +
  • 28. High over • Read from and write to C* using Spark as a colocated distributed processing platform Image by DataStax
  • 29. Software setup Python part PySpark Cassandra Scala part PySpark Cassandra Datastax Spark Cassandra Connector PySpark Spark CPython JVM JVM Cassandra Application
  • 30. Submit script spark-submit --packages TargetHolding/pyspark-cassandra:0.3.5 --conf spark.cassandra.connection.host=cas1,cas2,cas3 --master spark://spark-master:7077 yourscript.py import ... conf = SparkConf() sc = CassandraSparkContext(conf=conf) # your script
  • 33. Operators • Scan • Project (select) • Filter (where) • Limit, etc. • Count • 'Spanning' • Join • Save
  • 34. Scan • cassandraTable() to scanning C* – Determine basic token ranges – Group them into partitions • taking size and location into account – Execute (concurrent) CQL queries to C* rows = sc.cassandraTable('keyspace', 'table')
  • 35. Scan • Basically executing this query many times: SELECT columns FROM keyspace.table WHERE token(pk) > ? and token(pk) < ? filter ORDER BY ... LIMIT ... ALLOW FILTERING
  • 36. Scan • Quite tunable if neccessary sc.cassandraTable('keyspace', 'table', row_format=RowFormat.DICT, # ROW, DICT or TUPLE split_count=1000, # no partitions (splits) split_size=100000, # size of a partition fetch_size=1000, # query page size consistency_level='ALL', metrics_enabled=True )
  • 37. Project / Select • To make things go a little faster, select only the columns you need. – This saves in communication: C* ↔ Spark JVM ↔ CPython sc.cassandraTable(...).select('col1', 'col2', ...)
  • 38. Types CQL Python ascii unicode string bigint long blob bytearray boolean boolean counter int, long decimal decimal double float float float inet str int int CQL Python set set list list text unicode string timestamp datetime.datetime timeuuid uuid.UUID varchar unicode string varint long uuid uuid.UUID UDT pyspark_cassandra. UDT
  • 39. Key by primary key • Cassandra RDD's can be keyed by primary key – yielding an RDD of key value pairs – Keying by partition key not yet supported sc.cassandraTable(...).by_primary_key()
  • 40. Filter / where • Clauses on primary keys, clustering columns or secondary indices can be pushed down – If a where with allow filtering works in cql • Otherwise resort to RDD.filter or DF.filter sc.cassandraTable(...).where( 'col2 > ?', datetime.now() - timedelta(days=14) )
  • 41. Combine with 2i • With the cassandra lucene index sc.cassandraTable(...).where('lucene = ?', '''{ filter : { field: "loc", type: "geo_bbox", min_latitude: 53.217, min_longitude: 6.521, max_latitude: 53.219, max_longitude: 6.523 } }''')
  • 42. Limit, take and first • limit() the number of rows per query – there are as least as many queries as there are token ranges • take(n) at most n rows from the RDD – applying limit to make it just a tad bit faster sc.cassandraTable(...).limit(1)... sc.cassandraTable(...).take(3) sc.cassandraTable(...).first()
  • 43. Push down count • cassandraCount() pushes down count(*) queries down to C* – counting in partitions and then reduced • When all you want to do is count records in C* – doesn't force caching sc.cassandraTable(...).cassandraCount()
  • 44. Spanning • Wide rows in C* are retrieved in order, are consecutive and don't cross partition boundaries • spanBy() is like groupBy() for wide rows sc.cassandraTable(...).spanBy('doc_id')
  • 45. Save • Save any PySpark RDD to C* – for as long as it consists of dicts, tuples or Rows rdd.saveToCassandra('keyspace', 'table', ...)
  • 46. Save rows = [dict( key = k, stamp = datetime.now(), val = random() * 10, tags = ['a', 'b', 'c'], options = dict(foo='bar', baz='qux'), ) for k in ('x', 'y', 'z')] rdd = sc.parallelize(rows) rdd.saveToCassandra('keyspace', 'table')
  • 47. Save rdd.saveToCassandra(..., columns = ('col1', 'col2'), # The columns to save / how to interpret the elements in a tuple row_format = RowFormat.DICT,# RDD format hint keyed=True, # Whether RDD are key value pairs )
  • 48. Save rdd.saveToCassandra(..., batch_size = 16*1024, # Max. size of a batch batch_buffer_size = 1000, # Max. pending batches batch_grouping_key = "partition" # How batches are formed any / replicaset / partition consistency_level = "LOCAL_ONE", parallelism_level = 8, # Max. batches in flight throughput_mibps = MAX_LONG,# Max. MB/s ttl = timedelta(days=3), # TTL metrics_enabled = False )
  • 49. Save DStream • Just like saving an RDD! • But then for every micro-batch in the DStream dstream .map(lambda e: e[1]) .filter(lambda v: v > 3) .saveToCassandra('keyspace', 'table', ...)
  • 50. Join • Join (inner) any RDD with a C* table • No outer joins supported rdd.joinWithCassandraTable('keyspace', 'table') .on('id') .select('col1', 'col2')
  • 51. Join • Query per PK in left RDD for row(s) in joined table • Somewhat similar to hash join • Usual caveats of skew, 'shuffle' overhead, etc. • (repartitionByCassandraReplica not yet supported)
  • 53. Future work • Main engine: – Build on DF/S more? – Build on native python cassandra driver? – Use marshal or ... instead of pickle? • More features: – repartitioning – multi cluster – expose C* session to python • Suggestions?
  • 54. We're hiring http://www.target-holding.nl/organisatie/vacatures data scientists, backend and database engineers, webdevelopers, ...
  • 55. Q&A