Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PySpark Cassandra
Analytics with Cassandra and PySpark
+ +
Frens Jan Rumph
• Database and processing architect
at Target Holding
• Contact me at:
frens.jan.rumph@target-holding.nl
Target Holding
• Machine Learning Company
• Timeseries:
– Prediction, Search,
Anomaly detection, ...
• Text:
– Search, mat...
PySpark Cassandra
• Techology background
– Cassandra, Spark and PySpark
• PySpark Cassandra
– Introduction
– Features and ...
Technology background
Cassandra, Spark and PySpark
Cassandra
• Distributed database
• Originated at
Facebook
• Roots in Amazon
Dynamo
Cassandra Query Language
Main 'user interface' of Cassandra
with a SQL feel (tables with rows)
• DML
– Insert into ..., Se...
Distribution and replication
• Distributed map of
ordered maps
– Under the hood some
updates in C* 3
• Consistent hashing
...
Local datastructures and 2i
• Memtables
• SSTables
– Ordered within partition on
clustering columns
• Various caches
• Var...
Spark
• Distributed data
processing engine
• 'Doesn't touch disk if
it doesn't have to'
• Layers on top of data
sources
– ...
Resiliant Distributed Dataset
• Partitioned (and distributed) collection of rows
• Part of computational graph
– RDD has l...
Transformations and actions
• Narrow transformations
are 'data-local'
– map, filter, ...
• Wide transformations
aren't
– j...
Distribution / topology
Where your app 'lives'
Coordinate resources
Standalone MESOS or YARN
Where the
work happens
Image ...
PySpark
• Wrapper around Java APIs
– JVM for data shipping when working with Python RDDs
– 'Query language' when working w...
Pickle
• Object serialization shipped with Python
• Pickle used for messaging between
CPython interpreters and JVM
• cPick...
PySpark Cassandra
PySpark Cassandra
• Developed at Target Holding
– use a lot of python
– and Cassandra
– Spark option for processing
• Buil...
Features and use cases
PySpark Cassandra
Features
• Distributed C* table scanning into RDD's
• Writing RDD's and DStreams to C*
• Joining RDD's and DStreams with C...
Use cases
• Perform bulk 'queries' you normally can't
(C* doesn't do group by or join)
takes a prohibitive amount of time ...
Use cases at Target Holding
PySpark Cassandra
Media metadata processing
• We disambiguated close to 90 million
'authorships' using names,
contact details, publication
k...
Earthquake monitoring
• We are building an monitoring system
for use with many low cost vibration sensors
• Spark / PySpar...
Processing time series data
• We collect time series data in various fields
– Energy (electricity and gas usage)
– Music (...
Getting started
PySpark Cassandra
Getting started
• 'Homepage': github.com/
TargetHolding/pyspark-cassandra
• Available on: spark-packages.org/package/
Targ...
Compatibility
• Spark 1.5 and 1.6
– (supported older versions in the past)
• Cassandra 2.1.5, 2.2 and 3
• Python 2.7 and 3...
High over
• Read from and write to C* using Spark as a
colocated distributed processing platform
Image by DataStax
Software setup
Python part
PySpark Cassandra
Scala part
PySpark Cassandra
Datastax Spark
Cassandra Connector
PySpark Spark...
Submit script
spark-submit
--packages TargetHolding/pyspark-cassandra:0.3.5
--conf spark.cassandra.connection.host=cas1,ca...
PySpark shell
IPYTHON_OPTS=notebook
PYSPARK_DRIVER_PYTHON=ipython
pyspark
--packages TargetHolding/pyspark-cassandra:0.3.5...
Operators and examples
PySpark Cassandra
Operators
• Scan
• Project (select)
• Filter (where)
• Limit, etc.
• Count
• 'Spanning'
• Join
• Save
Scan
• cassandraTable() to scanning C*
– Determine basic token ranges
– Group them into partitions
• taking size and locat...
Scan
• Basically executing this query many times:
SELECT columns
FROM keyspace.table
WHERE
token(pk) > ? and
token(pk) < ?...
Scan
• Quite tunable if neccessary
sc.cassandraTable('keyspace', 'table',
row_format=RowFormat.DICT, # ROW, DICT or TUPLE
...
Project / Select
• To make things go a little faster,
select only the columns you need.
– This saves in communication:
C* ...
Types
CQL Python
ascii unicode string
bigint long
blob bytearray
boolean boolean
counter int, long
decimal decimal
double ...
Key by primary key
• Cassandra RDD's can be keyed by primary key
– yielding an RDD of key value pairs
– Keying by partitio...
Filter / where
• Clauses on primary keys, clustering columns or
secondary indices can be pushed down
– If a where with all...
Combine with 2i
• With the cassandra lucene index
sc.cassandraTable(...).where('lucene = ?', '''{
filter : {
field: "loc",...
Limit, take and first
• limit() the number of rows per query
– there are as least as many queries as there are token range...
Push down count
• cassandraCount() pushes down count(*)
queries down to C*
– counting in partitions and then reduced
• Whe...
Spanning
• Wide rows in C* are retrieved in order, are
consecutive and don't cross partition boundaries
• spanBy() is like...
Save
• Save any PySpark RDD to C*
– for as long as it consists of dicts, tuples or Rows
rdd.saveToCassandra('keyspace', 't...
Save
rows = [dict(
key = k,
stamp = datetime.now(),
val = random() * 10,
tags = ['a', 'b', 'c'],
options = dict(foo='bar',...
Save
rdd.saveToCassandra(...,
columns = ('col1', 'col2'), # The columns to save
/ how to interpret the elements in
a tuple...
Save
rdd.saveToCassandra(...,
batch_size = 16*1024, # Max. size of a batch
batch_buffer_size = 1000, # Max. pending batche...
Save DStream
• Just like saving an RDD!
• But then for every micro-batch in the DStream
dstream
.map(lambda e: e[1])
.filt...
Join
• Join (inner) any RDD with a C* table
• No outer joins supported
rdd.joinWithCassandraTable('keyspace', 'table')
.on...
Join
• Query per PK in left RDD
for row(s) in joined table
• Somewhat similar to hash join
• Usual caveats of skew, 'shuff...
Future work
PySpark Cassandra
Future work
• Main engine:
– Build on DF/S more?
– Build on native python cassandra driver?
– Use marshal or ... instead o...
We're hiring
http://www.target-holding.nl/organisatie/vacatures
data scientists, backend and database
engineers, webdevelo...
Q&A
Upcoming SlideShare
Loading in …5
×

PySpark Cassandra - Amsterdam Spark Meetup

1,701 views

Published on

At an Amsterdam Spark Meetup I gave a talk about how to work with Python Spark and Cassandra.

Published in: Technology

PySpark Cassandra - Amsterdam Spark Meetup

  1. 1. PySpark Cassandra Analytics with Cassandra and PySpark + +
  2. 2. Frens Jan Rumph • Database and processing architect at Target Holding • Contact me at: frens.jan.rumph@target-holding.nl
  3. 3. Target Holding • Machine Learning Company • Timeseries: – Prediction, Search, Anomaly detection, ... • Text: – Search, matching (e.g. jobs and resumes), ... • Markets: – media – human resources – infrastructure (energy, waterworks, ...) – health
  4. 4. PySpark Cassandra • Techology background – Cassandra, Spark and PySpark • PySpark Cassandra – Introduction – Features and use cases – Getting started – Operators and examples
  5. 5. Technology background Cassandra, Spark and PySpark
  6. 6. Cassandra • Distributed database • Originated at Facebook • Roots in Amazon Dynamo
  7. 7. Cassandra Query Language Main 'user interface' of Cassandra with a SQL feel (tables with rows) • DML – Insert into ..., Select from ..., Update ..., Delete from ... • DDL – Create table ..., Create index ... • Column types: – Numbers, strings, etc., – Collections (lists, sets and maps) – Counters
  8. 8. Distribution and replication • Distributed map of ordered maps – Under the hood some updates in C* 3 • Consistent hashing • Replication along ring – keys usually 'placed on ring' through hashing Image by DataStax
  9. 9. Local datastructures and 2i • Memtables • SSTables – Ordered within partition on clustering columns • Various caches • Various indices • Materialized views – manually < C* 3.0 – similar to normal tables • Secondary indices – scatter gather model – 'normal' – or 'search'
  10. 10. Spark • Distributed data processing engine • 'Doesn't touch disk if it doesn't have to' • Layers on top of data sources – HDFS, Cassandra, Elasticsearch, JDBC, ...
  11. 11. Resiliant Distributed Dataset • Partitioned (and distributed) collection of rows • Part of computational graph – RDD has linkage to 'source' RDD(s) • DataFrame, DataSet / Frame with stronger typing and declarative querying layered on top
  12. 12. Transformations and actions • Narrow transformations are 'data-local' – map, filter, ... • Wide transformations aren't – join, sort, reduce, group, ... • Actions to – read results to the driver – write results to disk, database, ... Image by Apache
  13. 13. Distribution / topology Where your app 'lives' Coordinate resources Standalone MESOS or YARN Where the work happens Image by Apache
  14. 14. PySpark • Wrapper around Java APIs – JVM for data shipping when working with Python RDDs – 'Query language' when working with DataFrames • CPython interpreters as (extra) executors – essentially the multiprocessing model – but distributed – cpython executors forked per job (not application)
  15. 15. Pickle • Object serialization shipped with Python • Pickle used for messaging between CPython interpreters and JVM • cPickle / cloudpickle in CPython • Py4J in the JVM
  16. 16. PySpark Cassandra
  17. 17. PySpark Cassandra • Developed at Target Holding – use a lot of python – and Cassandra – Spark option for processing • Build on Spark Cassandra Connector – Datastax provides Spark Cassandra Connector – Python + Cassandra link was missing – PySpark Cassandra
  18. 18. Features and use cases PySpark Cassandra
  19. 19. Features • Distributed C* table scanning into RDD's • Writing RDD's and DStreams to C* • Joining RDD's and DStreams with C* tables + +
  20. 20. Use cases • Perform bulk 'queries' you normally can't (C* doesn't do group by or join) takes a prohibitive amount of time / or just because it's easy once it's set up • Data wrangling ('cooking' features, etc.) • As a streaming processing platform
  21. 21. Use cases at Target Holding PySpark Cassandra
  22. 22. Media metadata processing • We disambiguated close to 90 million 'authorships' using names, contact details, publication keywords – In order to build a analytical applications for a large publisher of scientific journals • Spark / PySpark Cassandra for data wrangling
  23. 23. Earthquake monitoring • We are building an monitoring system for use with many low cost vibration sensors • Spark / PySpark Cassandra for – Enriching the event stream – Saving the event stream – Bulk processing – Anomaly detection
  24. 24. Processing time series data • We collect time series data in various fields – Energy (electricity and gas usage) – Music (tracking online music and video portals) • Spark / PySpark Cassandra for – data wrangling – rolling up data – bulk forecasting – anomaly detection
  25. 25. Getting started PySpark Cassandra
  26. 26. Getting started • 'Homepage': github.com/ TargetHolding/pyspark-cassandra • Available on: spark-packages.org/package/ TargetHolding/pyspark-cassandra • Also read: github.com /datastax/spark-cassandra-connector
  27. 27. Compatibility • Spark 1.5 and 1.6 – (supported older versions in the past) • Cassandra 2.1.5, 2.2 and 3 • Python 2.7 and 3 + +
  28. 28. High over • Read from and write to C* using Spark as a colocated distributed processing platform Image by DataStax
  29. 29. Software setup Python part PySpark Cassandra Scala part PySpark Cassandra Datastax Spark Cassandra Connector PySpark Spark CPython JVM JVM Cassandra Application
  30. 30. Submit script spark-submit --packages TargetHolding/pyspark-cassandra:0.3.5 --conf spark.cassandra.connection.host=cas1,cas2,cas3 --master spark://spark-master:7077 yourscript.py import ... conf = SparkConf() sc = CassandraSparkContext(conf=conf) # your script
  31. 31. PySpark shell IPYTHON_OPTS=notebook PYSPARK_DRIVER_PYTHON=ipython pyspark --packages TargetHolding/pyspark-cassandra:0.3.5 --conf ... ... import pyspark_cassandra
  32. 32. Operators and examples PySpark Cassandra
  33. 33. Operators • Scan • Project (select) • Filter (where) • Limit, etc. • Count • 'Spanning' • Join • Save
  34. 34. Scan • cassandraTable() to scanning C* – Determine basic token ranges – Group them into partitions • taking size and location into account – Execute (concurrent) CQL queries to C* rows = sc.cassandraTable('keyspace', 'table')
  35. 35. Scan • Basically executing this query many times: SELECT columns FROM keyspace.table WHERE token(pk) > ? and token(pk) < ? filter ORDER BY ... LIMIT ... ALLOW FILTERING
  36. 36. Scan • Quite tunable if neccessary sc.cassandraTable('keyspace', 'table', row_format=RowFormat.DICT, # ROW, DICT or TUPLE split_count=1000, # no partitions (splits) split_size=100000, # size of a partition fetch_size=1000, # query page size consistency_level='ALL', metrics_enabled=True )
  37. 37. Project / Select • To make things go a little faster, select only the columns you need. – This saves in communication: C* ↔ Spark JVM ↔ CPython sc.cassandraTable(...).select('col1', 'col2', ...)
  38. 38. Types CQL Python ascii unicode string bigint long blob bytearray boolean boolean counter int, long decimal decimal double float float float inet str int int CQL Python set set list list text unicode string timestamp datetime.datetime timeuuid uuid.UUID varchar unicode string varint long uuid uuid.UUID UDT pyspark_cassandra. UDT
  39. 39. Key by primary key • Cassandra RDD's can be keyed by primary key – yielding an RDD of key value pairs – Keying by partition key not yet supported sc.cassandraTable(...).by_primary_key()
  40. 40. Filter / where • Clauses on primary keys, clustering columns or secondary indices can be pushed down – If a where with allow filtering works in cql • Otherwise resort to RDD.filter or DF.filter sc.cassandraTable(...).where( 'col2 > ?', datetime.now() - timedelta(days=14) )
  41. 41. Combine with 2i • With the cassandra lucene index sc.cassandraTable(...).where('lucene = ?', '''{ filter : { field: "loc", type: "geo_bbox", min_latitude: 53.217, min_longitude: 6.521, max_latitude: 53.219, max_longitude: 6.523 } }''')
  42. 42. Limit, take and first • limit() the number of rows per query – there are as least as many queries as there are token ranges • take(n) at most n rows from the RDD – applying limit to make it just a tad bit faster sc.cassandraTable(...).limit(1)... sc.cassandraTable(...).take(3) sc.cassandraTable(...).first()
  43. 43. Push down count • cassandraCount() pushes down count(*) queries down to C* – counting in partitions and then reduced • When all you want to do is count records in C* – doesn't force caching sc.cassandraTable(...).cassandraCount()
  44. 44. Spanning • Wide rows in C* are retrieved in order, are consecutive and don't cross partition boundaries • spanBy() is like groupBy() for wide rows sc.cassandraTable(...).spanBy('doc_id')
  45. 45. Save • Save any PySpark RDD to C* – for as long as it consists of dicts, tuples or Rows rdd.saveToCassandra('keyspace', 'table', ...)
  46. 46. Save rows = [dict( key = k, stamp = datetime.now(), val = random() * 10, tags = ['a', 'b', 'c'], options = dict(foo='bar', baz='qux'), ) for k in ('x', 'y', 'z')] rdd = sc.parallelize(rows) rdd.saveToCassandra('keyspace', 'table')
  47. 47. Save rdd.saveToCassandra(..., columns = ('col1', 'col2'), # The columns to save / how to interpret the elements in a tuple row_format = RowFormat.DICT,# RDD format hint keyed=True, # Whether RDD are key value pairs )
  48. 48. Save rdd.saveToCassandra(..., batch_size = 16*1024, # Max. size of a batch batch_buffer_size = 1000, # Max. pending batches batch_grouping_key = "partition" # How batches are formed any / replicaset / partition consistency_level = "LOCAL_ONE", parallelism_level = 8, # Max. batches in flight throughput_mibps = MAX_LONG,# Max. MB/s ttl = timedelta(days=3), # TTL metrics_enabled = False )
  49. 49. Save DStream • Just like saving an RDD! • But then for every micro-batch in the DStream dstream .map(lambda e: e[1]) .filter(lambda v: v > 3) .saveToCassandra('keyspace', 'table', ...)
  50. 50. Join • Join (inner) any RDD with a C* table • No outer joins supported rdd.joinWithCassandraTable('keyspace', 'table') .on('id') .select('col1', 'col2')
  51. 51. Join • Query per PK in left RDD for row(s) in joined table • Somewhat similar to hash join • Usual caveats of skew, 'shuffle' overhead, etc. • (repartitionByCassandraReplica not yet supported)
  52. 52. Future work PySpark Cassandra
  53. 53. Future work • Main engine: – Build on DF/S more? – Build on native python cassandra driver? – Use marshal or ... instead of pickle? • More features: – repartitioning – multi cluster – expose C* session to python • Suggestions?
  54. 54. We're hiring http://www.target-holding.nl/organisatie/vacatures data scientists, backend and database engineers, webdevelopers, ...
  55. 55. Q&A

×