Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cassandra and Spark

864 views

Published on

Analyzing time series data with apache spark and cassandra

Published in: Technology
  • Be the first to comment

Cassandra and Spark

  1. 1. Apache Spark and Cassandra 1 Patrick McFadin
 Chief Evangelist for Apache Cassandra, DataStax @PatrickMcFadin
  2. 2. About me • Chief Evangelist for Apache Cassandra • Senior Solution Architect at DataStax • Chief Architect, Hobsons • Web applications and performance since 1996
  3. 3. Apache Cassandra
  4. 4. Cassandra is… • Shared nothing • Masterless peer-to-peer • Great scaling story • Resilient to failure
  5. 5. Cassandra for Applications APACHE CASSANDRA
  6. 6. Example: Weather Station • Weather station collects data • Cassandra stores in sequence • Application reads in sequence
  7. 7. Queries supported CREATE TABLE raw_weather_data (
 wsid text,
 year int,
 month int,
 day int,
 hour int,
 temperature double,
 dewpoint double,
 pressure double,
 wind_direction int,
 wind_speed double,
 sky_condition int,
 sky_condition_text text,
 one_hour_precip double,
 six_hour_precip double,
 PRIMARY KEY ((wsid), year, month, day, hour)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC); Get weather data given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time
  8. 8. Aggregation Queries CREATE TABLE daily_aggregate_temperature (
 wsid text,
 year int,
 month int,
 day int,
 high double,
 low double,
 mean double,
 variance double,
 stdev double,
 PRIMARY KEY ((wsid), year, month, day)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC); Get temperature stats given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time Windsor California July 1, 2014 High: 73.4 Low : 51.4
  9. 9. ?
  10. 10. Apache Spark
  11. 11. Map Reduce Input Data Map Reduce Intermediate Data Output Data Disk
  12. 12. Data Science at Scale 2009
  13. 13. In memory Input Data Map Reduce Intermediate Data Output Data Disk
  14. 14. In memory Input Data Spark Intermediate Data Output Data Disk Memory
  15. 15. Resilient Distributed Dataset
  16. 16. RDD Tranformations •Produces new RDD •Calls: filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract Are •Immutable •Partitioned •Reusable Actions •Start cluster computing operations •Calls: collect: Array[T], count, fold, reduce.. and Have
  17. 17. API filter groupBy sort union join leftOuterJoin rightOuterJoin count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save 
 ... reducemap
  18. 18. Spark Streaming Near Real-time SparkSQL Structured Data MLLib Machine Learning GraphX Graph Analysis
  19. 19. Cassandra and Spark
  20. 20. Great combo Store a ton of data Analyze a ton of data
  21. 21. Great combo Spark Streaming Near Real-time SparkSQL Structured Data MLLib Machine Learning GraphX Graph Analysis
  22. 22. Great combo Spark Streaming Near Real-time SparkSQL Structured Data MLLib Machine Learning GraphX Graph Analysis CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC); Spark Connector
  23. 23. Executer Master Worker Executer Executer Server
  24. 24. Master Worker Worker Worker Worker 0-24Token Ranges 0-100 25-49 50-74 75-99 I will only analyze 25% of the data.
  25. 25. Master Worker Worker Worker Worker 0-24 25-49 50-74 75-9975-99 0-24 25-49 50-74 AnalyticsTransactional
  26. 26. Executer Master Worker Executer Executer 75-99 SELECT * FROM keyspace.table WHERE token(pk) > 75 AND token(pk) <= 99 Spark RDD Spark Partition Spark Partition Spark Partition Spark Connector
  27. 27. Executer Master Worker Executer Executer 75-99 Spark RDD Spark Partition Spark Partition Spark Partition
  28. 28. Spark Connector Cassandra Cassandra + Spark Joins and Unions No Yes Transformations Limited Yes Outside Data Integration No Yes Aggregations Limited Yes
  29. 29. Spark Reads on Cassandra Awesome animation by DataStax’s own Russell Spitzer
  30. 30. Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 1 2 3 4 5 6 7 8 9Node 2 Node 1 Node 3 Node 4
  31. 31. Node 2 Node 1 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks RDD 2 346 7 8 9 Node 3 Node 4 1 5
  32. 32. Node 2 Node 1 RDD 2 346 7 8 9 Node 3 Node 4 1 5 Spark RDDs Represent a Large Amount of Data Partitioned into Chunks
  33. 33. Cassandra Data is Distributed By Token Range
  34. 34. Cassandra Data is Distributed By Token Range 0 500
  35. 35. Cassandra Data is Distributed By Token Range 0 500 999
  36. 36. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4
  37. 37. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 Without vnodes
  38. 38. Cassandra Data is Distributed By Token Range 0 500 Node 1 Node 2 Node 3 Node 4 With vnodes
  39. 39. Node 1 120-220 300-500 780-830 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions
  40. 40. Node 1 120-220 300-500 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 1 780-830
  41. 41. 1 Node 1 120-220 300-500 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830
  42. 42. 2 1 Node 1 300-500 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830
  43. 43. 2 1 Node 1 300-500 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830
  44. 44. 2 1 Node 1 300-400 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500
  45. 45. 21 Node 1 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500
  46. 46. 21 Node 1 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 400-500 3
  47. 47. 21 Node 1 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3 400-500
  48. 48. 21 Node 1 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3
  49. 49. 4 21 Node 1 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3
  50. 50. 4 21 Node 1 0-50 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 780-830 3
  51. 51. 421 Node 1 spark.cassandra.input.split.size 50 Reported  density  is  0.5 The Connector Uses Information on the Node to Make 
 Spark Partitions 3
  52. 52. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50780-830 Node 1
  53. 53. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
  54. 54. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50
  55. 55. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows
  56. 56. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows
  57. 57. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows 50 CQL Rows
  58. 58. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows
  59. 59. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows
  60. 60. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows
  61. 61. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows
  62. 62. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows
  63. 63. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  64. 64. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  65. 65. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  66. 66. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  67. 67. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  68. 68. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  69. 69. 4 spark.cassandra.input.page.row.size 50 Data is Retrieved Using the DataStax Java Driver 0-50 780-830 Node 1 SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50 50 CQL Rows50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows 50 CQL Rows
  70. 70. Attaching to Spark and Cassandra // Import Cassandra-specific functions on SparkContext and RDD objects import org.apache.spark.{SparkContext, SparkConf}
 import com.datastax.spark.connector._ /** The setMaster("local") lets us run & test the job right in our IDE */
 val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster(“local[*]") .setAppName(getClass.getName) // Optionally
 .set("cassandra.username", "cassandra")
 .set("cassandra.password", “cassandra") 
 val sc = new SparkContext(conf)
  71. 71. Weather station example CREATE TABLE raw_weather_data (
 wsid text, 
 year int, 
 month int, 
 day int, 
 hour int, 
 temperature double, 
 dewpoint double, 
 pressure double, 
 wind_direction int, 
 wind_speed double, 
 sky_condition int, 
 sky_condition_text text, 
 one_hour_precip double, 
 six_hour_precip double, 
 PRIMARY KEY ((wsid), year, month, day, hour)
 ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
  72. 72. Simple example /** keyspace & table */
 val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data")
 
 
 /** get a simple count of all the rows in the raw_weather_data table */
 val rowCount = tableRDD.count()
 
 
 println(s"Total Rows in Raw Weather Table: $rowCount")
 sc.stop()
  73. 73. Simple example /** keyspace & table */
 val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data")
 
 
 /** get a simple count of all the rows in the raw_weather_data table */
 val rowCount = tableRDD.count()
 
 
 println(s"Total Rows in Raw Weather Table: $rowCount")
 sc.stop() Executer SELECT * FROM isd_weather_data.raw_weather_data Spark RDD Spark Partition Spark Connector
  74. 74. Using CQL SELECT temperature
 FROM raw_weather_data
 WHERE wsid = '724940:23234'
 AND year = 2008
 AND month = 12
 AND day = 1; val cqlRRD = sc.cassandraTable("isd_weather_data", "raw_weather_data")
 .select("temperature")
 .where("wsid = ? AND year = ? AND month = ? AND DAY = ?",
 "724940:23234", "2008", "12", “1")
  75. 75. Using SQL! spark-sql> SELECT wsid, year, month, day, max(temperature) high, min(temperature) low
 FROM raw_weather_data
 WHERE month = 6
 AND temperature !=0.0
 GROUP BY wsid, year, month, day; 724940:23234 2008 6 1 15.6 10.0 724940:23234 2008 6 2 15.6 10.0 724940:23234 2008 6 3 17.2 11.7 724940:23234 2008 6 4 17.2 10.0 724940:23234 2008 6 5 17.8 10.0 724940:23234 2008 6 6 17.2 10.0 724940:23234 2008 6 7 20.6 8.9
  76. 76. SQL with a Join spark-sql> SELECT ws.name, raw.hour, raw.temperature
 FROM raw_weather_data raw
 JOIN weather_station ws
 ON raw.wsid = ws.id
 WHERE raw.wsid = '724940:23234'
 AND raw.year = 2008 AND raw.month = 6 AND raw.day = 1; SAN FRANCISCO INTL AP 23 15.0 SAN FRANCISCO INTL AP 22 15.0 SAN FRANCISCO INTL AP 21 15.6 SAN FRANCISCO INTL AP 20 15.0 SAN FRANCISCO INTL AP 19 15.0 SAN FRANCISCO INTL AP 18 14.4
  77. 77. Python from pyspark_cassandra import CassandraSparkContext
 
 from pyspark.sql import SQLContext
 from pyspark import SparkContext, SparkConf
 
 conf = SparkConf() 
 .setAppName("PySpark Cassandra Test") 
 .setMaster("spark://127.0.0.1:7077") 
 .set("spark.cassandra.connection.host", "127.0.0.1")
 
 sc = CassandraSparkContext(conf=conf)
 sql = SQLContext(sc)
 
 rows = sql.sql('''SELECT max(temperature) AS high, wsid, year, month, day
 FROM raw_weather_data
 WHERE wsid = '724940:23234'
 AND month = 6
 AND day =1
 AND temperature !=0.0
 GROUP BY wsid, year, month, day;''').collect()
 
 for row in rows:
 print row.wsid, row.day, row.high
  78. 78. Analyzing large data sets val spanRDD = sc.cassandraTable[Double]("isd_weather_data", "raw_weather_data")
 .select("temperature")
 .where("wsid = ? AND year = ? AND month = ? AND DAY = ?",
 "724940:23234", "2008", "12", "1").spanBy(row => (row.getString("wsid"))) •Specify partition grouping •Use with large partitions •Perfect for time series
  79. 79. Weather Station Analysis • Weather station collects data • Cassandra stores in sequence • Spark rolls up data into new tables Windsor California July 1, 2014 High: 73.4 Low : 51.4
  80. 80. Saving back the weather data val cc = new CassandraSQLContext(sc)
 cc.setKeyspace("isd_weather_data")
 cc.sql("""
 SELECT wsid, year, month, day, max(temperature) high, min(temperature) low
 FROM raw_weather_data
 WHERE month = 6
 AND temperature !=0.0
 GROUP BY wsid, year, month, day;
 """)
 .map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))}
 .saveToCassandra("isd_weather_data", "daily_aggregate_temperature")
  81. 81. What just happened • Data is read from temperature table • Transformed • Inserted into the daily_high_low table Table: temperature Table: daily_high_low Read data from table Transform Insert data into table
  82. 82. Spark Streaming
  83. 83. The problem domain Petabytes of data Gigabytes Per Second
  84. 84. Analytic Analytic Search Spark Streaming Kinesis,'S3'
  85. 85. DStream - Micro Batches μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) Processing of DStream = Processing of μBatches, RDDs DStream • Continuous sequence of micro batches • More complex processing models are possible with less effort • Streaming computations as a series of deterministic batch computations on small time intervals
  86. 86. Now what? Cassandra Only DC Cassandra + Spark DC Spark Jobs Spark Streaming
  87. 87. You can do this at home! https://github.com/killrweather/killrweather
  88. 88. PatrickM50- 50% off Priority Pass PatrickMCert- 25% Certification
  89. 89. Thank you! Bring the questions Follow me on twitter @PatrickMcFadin

×