Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analytics with Spark and Cassandra

2,873 views

Published on

Apache Cassandra is a leading open-source distributed database capable of amazing feats of scale, but its data model requires a bit of planning for it to perform well. Of course, the nature of ad-hoc data exploration and analysis requires that we be able to ask questions we hadn’t planned on asking—and get an answer fast. Enter Apache Spark.

Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages. It’s exactly what a Cassandra cluster needs to deliver real-time, ad-hoc querying of operational data at scale.

In this talk, we’ll explore Spark and see how it works together with Cassandra to deliver a powerful open-source big data analytic solution.

Published in: Technology

Analytics with Spark and Cassandra

  1. 1. @tlberglund and Spark Cassandra
  2. 2. @tlberglund and Spark Cassandra
  3. 3. Cassandra
  4. 4. • You have too much data • It changes a lot • The system is always on When?
  5. 5. • Distributed database • Transactional/online • Low latency, high velocity What :
  6. 6. CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC)); Data Model
  7. 7. CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC)); Partition Key
  8. 8. CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC)); Partition Key
  9. 9. CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC)); Clustering Column
  10. 10. CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC)); Clustering Column
  11. 11. CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC)); Primary Key
  12. 12. Data Model card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158
  13. 13. card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158 Partition Key
  14. 14. card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158 Clustering Column
  15. 15. card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158 Primary Key
  16. 16. card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158 Partition
  17. 17. card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158 Partition
  18. 18. card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158 Partition
  19. 19. card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158 Partition
  20. 20. 2000 4000 6000 8000 A000 C000 E000 0000 Partitioning 5444…8389 14:03 $98.03 US 158 f(x) 5D93
  21. 21. 2000 4000 6000 8000 A000 C000 E000 0000 Partitioning 5444…8389 14:03 $98.03 US 158 5D93
  22. 22. 2000 4000 6000 8000 A000 C000 E000 0000 Partitioning 5444…8389 14:03 $98.03 US 158 f(x) 5D93 5444…8389
  23. 23. card_number xctn_time amount country merchant 5444…8389 4532…1191 5444…6027 4532…2189 13:58 13:59 13:59 14:02 5444…6027 14:03 5444…8389 14:03 $12.42 $50.38 $3.02 $12.42 €23.78 $98.03 US US US US ES US 101 101 102 198 352 158 Partition Partitioning
  24. 24. • Clumps of related data • Always on one server • Accessed by consistent hashing Partitions
  25. 25. • Design for low-latency, high- throughput reads and writes • Inflexible ad-hoc queries • Very little aggregation • Very little secondary indexing Consequences
  26. 26. Spark
  27. 27. You have too much data You don’t know the questions You don’t like MapReduce You’re tired of Hadoop lol When do I use it?
  28. 28. Distributed, functional “In-memory” “Real-time” Next-generation MapReduce What are they saying?
  29. 29. Paradigm : Store collections in Resilient Distributed Datasets, or “RDDs”
  30. 30. Paradigm : Transform RDDs into other RDDs
  31. 31. Paradigm : Aggregate RDDs with “actions”
  32. 32. Paradigm : Keep track of a graph of transformations and actions
  33. 33. Paradigm : Distribute all of this storage and computation
  34. 34. 2000 4000 6000 8000 A000 C000 E000 0000 Architecture Spark Client Spark Context Driver
  35. 35. Architecture Spark Client Spark ContextJob
  36. 36. Architecture Spark Client Spark Context Job Cluster Manager
  37. 37. Architecture Spark Client Spark Context Cluster Manager JobTaskTaskTaskTask
  38. 38. Architecture Spark Client Spark Context Cluster Manager Job Task Task Task Task
  39. 39. Worker Node Executor Task Task Cache
  40. 40. Worker Node Executor Task Task Cache Executor Task Task CacheRDD RDD RDD RDD RDD RDD
  41. 41. What’s an RDD?
  42. 42. What’s an RDD? This is
  43. 43. Bigger than a computer What’s an RDD?
  44. 44. Read from an input source? What’s an RDD?
  45. 45. The output of a pure function? What’s an RDD?
  46. 46. Immutable What’s an RDD?
  47. 47. Typed What’s an RDD?
  48. 48. Ordered What’s an RDD?
  49. 49. Functions comprise a graph What’s an RDD?
  50. 50. Lazily Evaluated What’s an RDD?
  51. 51. Collection of Things What’s an RDD?
  52. 52. Distributed, how?
  53. 53. Distributed, how? 5444 5676 0686 8389: List(xn-1, xn-2, xn-3) 4532 4569 7030 1191: List(xn-4, xn-5, xn-6) 5444 5607 6517 6027: List(xn-7, xn-8, xn-9) 4532 4577 0122 2189: Hash these
  54. 54. val conf = new SparkConf(true) .set("spark.cassandra.connection.host", “10.0.3.74”) val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val transactions = sc. // cassandra stuff .partitionBy(new HashPartitioner(25)).persist() } }
  55. 55. Connector Cassandra
  56. 56. import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ import com.datastax.spark.connector._ object ScalaProgram { def main(args: Array[String]) { val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val rdd = sc.textFile(“/actual/path/raven.txt”) } } From a Text File
  57. 57. CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC)); Data Model
  58. 58. val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val xctns = sc.cassandraTable(“payments”, “credit_card_transactions") From a Table
  59. 59. val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val xctns = sc.cassandraTable(“payments”, “credit_card_transactions”) .select(“card_number”, “country”) Projection
  60. 60. val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val countries = xctns.map(x => (x.getString(“card_number”), x.getString(“country”)) Making k / v pairs
  61. 61. val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val cCount = xctns.reduce(card, country => // cool analysis redacted! ) Count up Countries
  62. 62. val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val countries = xctns.map(x => (x._1, x._2)) Make k / v pairs
  63. 63. val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val fraud = cCount.saveToCassandra(“payments”, “frequent_countries”) Persisting to a Table
  64. 64. Thank you! @tlberglund

×