Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with Apache: Spark, Kafka, Cassandra and Akka


Published on

Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application:

Published in: Technology

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with Apache: Spark, Kafka, Cassandra and Akka

  1. 1. Streaming Big Data with Apache Spark, Kafka and Cassandra Helena Edelson @helenaedelson ©2014 DataStax Confidential. Do not distribute without consent. 1
  2. 2. 1 Delivering Meaning In Near-Real Time At High Velocity 2 Overview Of Spark Streaming, Cassandra, Kafka and Akka 3 The Spark Cassandra Connector 4 Integration In Big Data Applications © 2014 DataStax, All Rights Reserved. 2
  3. 3. Who Is This Person? • Using Scala in production since 2010! • Spark Cassandra Connector Committer! • Akka (Cluster) Contributor ! • Scala Driver for Cassandra Committer ! • @helenaedelson! • ! • Senior Software Engineer, Analytics Team, DataStax! • Spent the last few years as a Senior Cloud Engineer Analytic Analytic
  4. 4. Analytic Analytic Search Spark WordCount
  5. 5. Delivering Meaning In Near-Real Time At High Velocity At Massive Scale ©2014 DataStax Confidential. Do not distribute without consent. 5
  6. 6. Strategies • Partition For Scale! • Replicate For Resiliency! • Share Nothing! • Asynchronous Message Passing! • Parallelism! • Isolation! • Location Transparency
  7. 7. What We Need • Fault Tolerant ! • Failure Detection! • Fast - low latency, distributed, data locality! • Masterless, Decentralized Cluster Membership! • Span Racks and DataCenters! • Hashes The Node Ring ! • Partition-Aware! • Elasticity ! • Asynchronous - message-passing system! • Parallelism! • Network Topology Aware!
  8. 8. Why Akka Rocks • location transparency! • fault tolerant! • async message passing! • non-deterministic! • share nothing! • actor atomicity (w/in actor)! !
  9. 9. Apache Kafka! From Joe Stein
  10. 10. Lambda Architecture Spark SQL structured Spark Core Cassandra Spark Streaming real-time MLlib machine learning GraphX graph Apache ! Kafka
  11. 11. Apache Spark and Spark Streaming ©2014 DataStax Confidential. Do not distribute without consent. 12 The Drive-By Version
  12. 12. Analytic Analytic Search What Is Apache Spark • Fast, general cluster compute system! • Originally developed in 2009 in UC Berkeley’s AMPLab! • Fully open sourced in 2010 – now at Apache Software Foundation! • Distributed, Scalable, Fault Tolerant
  13. 13. Apache Spark - Easy to Use & Fast • 10x faster on disk,100x faster in memory than Hadoop MR! • Works out of the box on EMR! • Fault Tolerant Distributed Datasets! • Batch, iterative and streaming analysis! • In Memory Storage and Disk ! • Integrates with Most File and Storage Options Up to 100× faster (2-10× on disk) Analytic 2-5× less code Analytic Search
  14. 14. Spark Components Spark SQL structured Spark Core Spark Streaming real-time MLlib machine learning GraphX graph
  15. 15. • Functional • On the JVM • Capture functions and ship them across the network • Static typing - easier to control performance • Leverage Scala REPL for the Spark REPL html Analytic Analytic Search Why Scala?
  16. 16. Spark Versus Spark Streaming zillions of bytes gigabytes per second
  17. 17. Input & Output Sources
  18. 18. Analytic Analytic Search Spark Streaming Kinesis,'S3'
  19. 19. Common Use Cases applications sensors web mobile phones intrusion detection malfunction detection site analytics network metrics analysis fraud detection dynamic process optimisation recommendations location based ads log processing supply chain planning sentiment analysis …
  20. 20. DStream - Micro Batches • Continuous sequence of micro batches! • More complex processing models are possible with less effort! • Streaming computations as a series of deterministic batch computations on small time intervals DStream μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) Processing of DStream = Processing of μBatches, RDDs
  21. 21. DStreams representing the stream of raw data received from streaming sources. ! !•!Basic sources: Sources directly available in the StreamingContext API! !•!Advanced sources: Sources available through external modules available in Spark as artifacts 22 InputDStreams
  22. 22. Spark Streaming Modules GroupId ArtifactId Latest Version org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0 org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7) org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7) org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7) org.apache.spark spark-streaming-flume-sink_2.10 1.1.0 org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7) org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
  23. 23. Streaming Window Operations // where pairs are (word,count) pairsStream .flatMap { case (k,v) => (k,v.value) } .reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10)) .saveToCassandra(keyspace,table) 24
  24. 24. Streaming Window Operations • window(Duration, Duration) • countByWindow(Duration, Duration) • reduceByWindow(Duration, Duration) • countByValueAndWindow(Duration, Duration) • groupByKeyAndWindow(Duration, Duration) • reduceByKeyAndWindow((V, V) => V, Duration, Duration)
  25. 25. Ten Things About Cassandra You always wanted to know but were afraid to ask ©2014 DataStax Confidential. Do not distribute without consent. 26
  26. 26. Quick intro to Cassandra • Shared nothing • Masterless peer-to-peer • Based on Dynamo • Highly Scalable • Spans DataCenters
  27. 27. Apache Cassandra • Elasticity - scale to as many nodes as you need, when you need! •! Always On - No single point of failure, Continuous availability!! !•! Masterless peer to peer architecture! •! Designed for Replication! !•! Flexible Data Storage! !•! Read and write to any node syncs across the cluster! !•! Operational simplicity - with all nodes in a cluster being the same, there is no complex configuration to manage
  28. 28. Easy to use • CQL - familiar syntax! • Friendly to programmers! • Paxos for locking CREATE TABLE users (! username varchar,! firstname varchar,! lastname varchar,! email list<varchar>,! password varchar,! created_date timestamp,! PRIMARY KEY (username)! ); INSERT INTO users (username, firstname, lastname, ! email, password, created_date)! VALUES ('hedelson','Helena','Edelson',! [‘'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00');! INSERT INTO users (username, firstname, ! lastname, email, password, created_date)! VALUES ('pmcfadin','Patrick','McFadin',! [''],! 'ba27e03fd95e507daf2937c937d499ab',! '2011-06-20 13:50:00')! IF NOT EXISTS;
  29. 29. Spark Cassandra Connector ©2014 DataStax Confidential. Do not distribute without consent. 30
  30. 30. Spark On Cassandra • Server-Side filters (where clauses)! • Cross-table operations (JOIN, UNION, etc.)! • Data locality-aware (speed)! • Data transformation, aggregation, etc. ! • Natural Time Series Integration
  31. 31. Spark Cassandra Connector • Loads data from Cassandra to Spark! • Writes data from Spark to Cassandra! • Implicit Type Conversions and Object Mapping! • Implemented in Scala (offers a Java API)! • Open Source ! • Exposes Cassandra Tables as Spark RDDs + Spark DStreams
  32. 32. Spark Cassandra Connector ! C* User Application Spark-Cassandra Connector Cassandra C* C* C* Spark Executor C* Java (Soon Scala) Driver
  33. 33. Use Cases: KillrWeather • Get data by weather station! • Get data for a single date and time! • Get data for a range of dates and times! • Compute, store and quickly retrieve daily, monthly and annual aggregations of data! Data Model to support queries • Store raw data per weather station • Store time series in order: most recent to oldest • Compute and store aggregate data in the stream • Set TTLs on historic data
  34. 34. Cassandra Data Model user_name tweet_id publication_date message pkolaczk 4b514a41-­‐9cf... 2014-­‐08-­‐21 10:00:04 ... pkolaczk 411bbe34-­‐11d... 2014-­‐08-­‐24 16:03:47 ... ewa 73847f13-­‐771... 2014-­‐08-­‐21 10:00:04 ... ewa 41106e9a-­‐5ff... 2014-­‐08-­‐21 10:14:51 ... row partitioning key partition primary key data columns
  35. 35. Spark Data Model A1 A2 A3 A4 A5 A6 A7 A8 map B1 B2 B3 B4 B5 B6 B7 B8 filter B1 B2 B5 B7 B8 C reduce Resilient Distributed Dataset! ! A Scala collection:! • immutable! • iterable! • serializable! • distributed! • parallel! • lazy evaluation
  36. 36. Spark Cassandra Example val conf = new SparkConf(loadDefaults = true) .set("", "") .setMaster("spark://") ! val sc = new SparkContext(conf) ! ! val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets") ! val ssc = new StreamingContext(sc, Seconds(30)) val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)"demo", "wordcount") ssc.start() ssc.awaitTermination() ! Initialization CassandraRDD Stream Initialization Transformations and Action
  37. 37. Spark Cassandra Example val sc = new SparkContext(..) val ssc = new StreamingContext(sc, Seconds(5)) ! val stream = TwitterUtils.createStream(ssc, auth, filters, StorageLevel.MEMORY_ONLY_SER_2) val transform = (cruft: String) => Pattern.findAllIn(cruft).flatMap(_.stripPrefix("#")) /** Note that Cassandra is doing the sorting for you here. */ stream.flatMap(_.getText.toLowerCase.split("""s+""")) .map(transform) .countByValueAndWindow(Seconds(5), Seconds(5)) .transform((rdd, time) => { case (term, count) => (term, count, now(time))}) .saveToCassandra(keyspace, suspicious, SomeColumns(“suspicious", "count", “timestamp"))
  38. 38. Reading: From C* To Spark row representation keyspace table val table = sc .cassandraTable[CassandraRow]("keyspace", "tweets") .select("user_name", "message") .where("user_name = ?", "ewa") server side column and row selection
  39. 39. CassandraRDD class CassandraRDD[R](..., keyspace: String, table: String, ...) extends RDD[R](...) { // Splits the table into multiple Spark partitions, // each processed by single Spark Task override def getPartitions: Array[Partition] // Returns names of hosts storing given partition (for data locality!) override def getPreferredLocations(split: Partition): Seq[String] // Returns iterator over Cassandra rows in the given partition override def compute(split: Partition, context: TaskContext): Iterator[R] }
  40. 40. CassandraStreamingRDD /** RDD representing a Cassandra table for Spark Streaming. * @see [[com.datastax.spark.connector.rdd.CassandraRDD]] */ class CassandraStreamingRDD[R] private[connector] ( sctx: StreamingContext, connector: CassandraConnector, keyspace: String, table: String, columns: ColumnSelector = AllColumns, where: CqlWhereClause = CqlWhereClause.empty, readConf: ReadConf = ReadConf())( implicit ct : ClassTag[R], @transient rrf: RowReaderFactory[R]) extends CassandraRDD[R](sctx.sparkContext, connector, keyspace, table, columns, where, readConf)
  41. 41. Paging Reads with .cassandraTable • Page size is configurable! • Controls how many CQL rows to fetch at a time, when fetching a single partition! • Connector returns an iterator for rows to Spark! • Spark iterates over this, lazily ! • Handled by the java driver as well as spark
  42. 42. ResultSet Paging and Pre-Fetching Node 1 Client Cassandra Node 1 request a page data process data request a page data request a page Node 2 Client Cassandra Node 2 request a page data process data request a page data request a page
  43. 43. Co-locate Spark and C* for Best Performance 44 C* C* C* Spark Worker C* Spark Worker Spark Master Spark Worker Running Spark Workers on the same nodes as your C* Cluster will save network hops when reading and writing
  44. 44. The Key To Speed - Data Locality • LocalNodeFirstLoadBalancingPolicy! • Decides what node will become the coordinator for the given mutation/read! • Selects local node first and then nodes in the local DC in random order! • Once that node receives the request it will be distributed ! • Proximal Node Sort Defined by the C* snitch! Analytic • apache/cassandra/locator/ Analytic L190 Search
  45. 45. Column Type Conversions trait TypeConverter[T] extends Serializable { def targetTypeTag: TypeTag[T] def convertPF: PartialFunction[Any, T] } ! class CassandraRow(...) { def get[T](index: Int)(implicit c: TypeConverter[T]): T = … def get[T](name: String)(implicit c: TypeConverter[T]): T = … … } ! implicit object IntConverter extends TypeConverter[Int] { def targetTypeTag = implicitly[TypeTag[Int]] convertPF = { case def x: Number => x.intValue case x: String => x.toInt } }
  46. 46. Rows to Case Class Instances val row: CassandraRow = table.first case class Tweet(userName: String, tweetId: UUID, publicationDate: Date, message: String) extends Serializable ! val tweets = sc.cassandraTable[Tweet]("db", "tweets") Scala Cassandra message message column1 column_1 userName user_name Scala Cassandra Message Message column1 column1 userName userName Recommended convention:
  47. 47. Convert Rows To Tuples ! val tweets = sc .cassandraTable[(String, String)]("db", "tweets") .select("userName", "message") When returning tuples, always use select to ! specify the column order!
  48. 48. Handling Unsupported Types case class Tweet( userName: String, tweetId: my.private.implementation.of.UUID, publicationDate: Date, message: String) ! val tweets: CassandraRDD[Tweet] = sc.cassandraTable[Tweet]("db", "tweets") Current behavior: runtime error Future: compile-time error (macros!) Macros could even parse Cassandra schema!
  49. 49. Connector Code and Docs! ! ! ! ! ! ! Add It To Your Project:! ! val connector = "com.datastax.spark" %% "spark-­‐cassandra-­‐connector" % "1.1.0-­‐beta2"
  50. 50. What’s New In Spark? •Petabyte sort record! • Myth busting: ! • Spark is in-memory. It doesn’t work with big data! • It’s too expensive to buy a cluster with enough memory to fit our data! •Application Integration! • Tableau, Trifacta, Talend, ElasticSearch, Cassandra! •Ongoing development for Spark 1.2! • Python streaming, new MLib API, Yarn scaling…
  51. 51. Recent / Pending Spark Updates •Usability, Stability & Performance Improvements ! • Improved Lightning Fast Shuffle!! • Monitoring the performance long running or complex jobs! •Public types API to allow users to create SchemaRDD’s! •Support for registering Python, Scala, and Java lambda functions as UDF ! •Dynamic bytecode generation significantly speeding up execution for queries that perform complex expression evaluation
  52. 52. Recent Spark Streaming Updates •Apache Flume: a new pull-based mode (simplifying deployment and providing high availability)! •The first of a set of streaming machine learning algorithms is introduced with streaming linear regression.! •Rate limiting has been added for streaming inputs
  53. 53. Recent MLib Updates • New library of statistical packages which provides exploratory analytic functions *stratified sampling, correlations, chi-squared tests, creating random datasets…)! • Utilities for feature extraction (Word2Vec and TF-IDF) and feature transformation (normalization and standard scaling). ! • Support for nonnegative matrix factorization and SVD via Lanczos. ! • Decision tree algorithm has been added in Python and Java. ! • Tree aggregation primitive! • Performance improves across the board, with improvements of around
  54. 54. Recent/Pending Connector Updates •Spark SQL (came in 1.1)! •Python API! •Official Scala Driver for Cassandra! •Removing last traces of Thrift!!! •Performance Improvements! • Token-aware data repartitioning! • Token-aware saving! • Wide-row support - no costly groupBy call
  55. 55. Resources •Spark Cassandra Connector ! ! •Apache Cassandra! •Apache Spark ! •Apache Kafka ! •Akka Analytic Analytic
  56. 56.
  57. 57. Thanks for listening! SEPTEMBER 10 -­‐ 11, 2014 | SAN FRANCISCO, CALIF. | THE WESTIN ST. FRANCIS HOTEL Cassandra Summit