Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Munich March 2015 - Cassandra + Spark Overview

596 views

Published on

Munich March 2015 - Cassandra + Spark Overview

Published in: Software
  • Be the first to comment

  • Be the first to like this

Munich March 2015 - Cassandra + Spark Overview

  1. 1. @chbatey Christopher Batey
 Technical Evangelist for Apache Cassandra Cassandra Spark Integration
  2. 2. @chbatey Who am I? What is DataStax? • Technical Evangelist for Apache Cassandra • Cassandra related questions: @chbatey • DataStax - Enterprise ready version of Cassandra - Spark integration - Solr integration - OpsCenter - Majority of Cassandra drivers
  3. 3. @chbatey Agenda • Cassandra overview • Spark overview • Spark Cassandra connector • Cassandra Spark examples
  4. 4. @chbatey Cassandra
  5. 5. @chbatey Cassandra for Applications APACHE CASSANDRA
  6. 6. @chbatey Common use cases •Ordered data such as time series -Event stores -Financial transactions -Sensor data e.g IoT
  7. 7. @chbatey Common use cases •Ordered data such as time series -Event stores -Financial transactions -Sensor data e.g IoT •Non functional requirements: -Linear scalability -High throughout durable writes -Multi datacenter including active-active -Analytics without ETL
  8. 8. @chbatey Cassandra Cassandra • Distributed masterless database (Dynamo) • Column family data model (Google BigTable)
  9. 9. @chbatey Datacenter and rack aware Europe • Distributed master less database (Dynamo) • Column family data model (Google BigTable) • Multi data centre replication built in from the start USA
  10. 10. @chbatey Cassandra Online • Distributed master less database (Dynamo) • Column family data model (Google BigTable) • Multi data centre replication built in from the start • Analytics with Apache SparkAnalytics
  11. 11. @chbatey Scalability & Performance • Scalability - No single point of failure - No special nodes that become the bottle neck - Work/data can be re-distributed • Operational Performance i.e single digit ms - Single node for query - Single disk seek per query
  12. 12. @chbatey Cassandra can not join or aggregate Client Where do I go for the max?
  13. 13. @chbatey But but… • Sometimes you don’t need a answers in milliseconds • Data models done wrong - how do I fix it? • New requirements for old data? • Ad-hoc operational queries • Reports and Analytics - Managers always want counts / maxs
  14. 14. @chbatey Apache Spark • 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets • Batch, iterative and streaming analysis • In Memory Storage and Disk • Integrates with Most File and Storage Options
  15. 15. @chbatey Components Shark or
 Spark SQL Streaming ML Spark (General execution engine) Graph Cassandra Compatible
  16. 16. @chbatey Spark architecture
  17. 17. @chbatey org.apache.spark.rdd.RDD • Resilient Distributed Dataset (RDD) • Created through transformations on data (map,filter..) or other RDDs • Immutable • Partitioned • Reusable
  18. 18. @chbatey RDD Operations • Transformations - Similar to Scala collections API • Produce new RDDs • filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract • Actions • Require materialization of the records to generate a value • collect: Array[T], count, fold, reduce..
  19. 19. @chbatey Word count val file: RDD[String] = sc.textFile("hdfs://...")
 val counts: RDD[(String, Int)] = file.flatMap(line => line.split(" "))
 .map(word => (word, 1))
 .reduceByKey(_ + _) 
 counts.saveAsTextFile("hdfs://...")
  20. 20. @chbatey Spark shell
  21. 21. @chbatey Spark Streaming
  22. 22. @chbatey Cassandra + Spark
  23. 23. @chbatey Spark Cassandra Connector • Loads data from Cassandra to Spark • Writes data from Spark to Cassandra • Implicit Type Conversions and Object Mapping • Implemented in Scala (offers a Java API) • Open Source • Exposes Cassandra Tables as Spark RDDs + Spark DStreams
  24. 24. @chbatey Analytics Workload Isolation
  25. 25. @chbatey Deployment • Spark worker in each of the Cassandra nodes • Partitions made up of LOCAL cassandra data S C S C S C S C
  26. 26. @chbatey Example Time
  27. 27. @chbatey It is on Github "org.apache.spark" %% "spark-core" % sparkVersion
 
 "org.apache.spark" %% "spark-streaming" % sparkVersion
 
 "org.apache.spark" %% "spark-sql" % sparkVersion
 
 "org.apache.spark" %% "spark-streaming-kafka" % sparkVersion
 
 "com.datastax.spark" % "spark-cassandra-connector_2.10" % connectorVersion
  28. 28. @chbatey Boiler plate import com.datastax.spark.connector.rdd._
 import org.apache.spark._
 import com.datastax.spark.connector._
 import com.datastax.spark.connector.cql._
 
 object BasicCassandraInteraction extends App {
 val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
 val sc = new SparkContext("local[4]", "AppName", conf) // cool stuff
 
 } Cassandra Host Spark master e.g spark://host:port
  29. 29. @chbatey Word Count + Save to Cassandra val textFile: RDD[String] = sc.textFile("Spark-Readme.md")
 
 val words: RDD[String] = textFile.flatMap(line => line.split("s+"))
 val wordAndCount: RDD[(String, Int)] = words.map((_, 1))
 val wordCounts: RDD[(String, Int)] = wordAndCount.reduceByKey(_ + _)
 
 println(wordCounts.first())
 
 wordCounts.saveToCassandra("test", "words", SomeColumns("word", "count"))
  30. 30. @chbatey Denormalised table CREATE TABLE IF NOT EXISTS customer_events( customer_id text, time timestamp, id uuid, event_type text,
 store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((customer_id), time, id))
  31. 31. @chbatey Store it twice CREATE TABLE IF NOT EXISTS customer_events( customer_id text, time timestamp, id uuid, event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((customer_id), time, id)) 
 CREATE TABLE IF NOT EXISTS customer_events_by_staff( customer_id text, time timestamp, id uuid, event_type text, store_name text, store_type text, store_location text, staff_name text, staff_title text, PRIMARY KEY ((staff_name), time, id))
  32. 32. @chbatey My reaction a year ago
  33. 33. @chbatey Too simple val events_by_customer = sc.cassandraTable("test", “customer_events") 
 events_by_customer.saveToCassandra("test", "customer_events_by_staff", SomeColumns("customer_id", "time", "id", "event_type", "staff_name", "staff_title", "store_location", "store_name", "store_type"))

  34. 34. @chbatey Aggregations with Spark SQL Partition Key Clustering Columns
  35. 35. @chbatey Now now… val cc = new CassandraSQLContext(sc)
 cc.setKeyspace("test")
 val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events GROUP BY store_name, event_type")
 rdd.collect().foreach(println) [SportsApp,WATCH_STREAM,1] [SportsApp,LOGOUT,1] [SportsApp,LOGIN,1] [ChrisBatey.com,WATCH_MOVIE,1] [ChrisBatey.com,LOGOUT,1] [ChrisBatey.com,BUY_MOVIE,1] [SportsApp,WATCH_MOVIE,2]
  36. 36. @chbatey Lamda architecture http://lambda-architecture.net/
  37. 37. @chbatey Network word count CassandraConnector(conf).withSessionDo { session =>
 session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)")
 session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)")
 } 
 val ssc = new StreamingContext(conf, Seconds(5))
 val lines = ssc.socketTextStream("localhost", 9999)
 lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw")
 
 val words = lines.flatMap(_.split("s+"))
 val countOfOne = words.map((_, 1))
 val reduced = countOfOne.reduceByKey(_ + _)
 reduced.saveToCassandra("test", "network_word_count")
  38. 38. @chbatey Summary • Cassandra is an operational database • Spark gives us the flexibility to do slower things - Schema migrations - Ad-hoc queries - Report generation • Spark streaming + Cassandra allow us to build online analytical platforms
  39. 39. @chbatey Thanks for listening • Follow me on twitter @chbatey • Cassandra + Fault tolerance posts a plenty: • http://christopher-batey.blogspot.co.uk/ • Github for all examples: • https://github.com/chbatey/spark-sandbox • Cassandra resources: http://planetcassandra.org/ • In London in April? http://www.eventbrite.com/e/cassandra- day-london-2015-april-22nd-2015-tickets-15053026006? aff=CommunityLanding

×