Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

6,900 views

Published on

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Published in: Data & Analytics

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

  1. 1. Streaming Analytics with Spark, Kafka, Cassandra, and Akka Helena Edelson VP of Product Engineering @Tuplejump
  2. 2. • Committer / Contributor: Akka, FiloDB, Spark Cassandra Connector, Spring Integration • VP of Product Engineering @Tuplejump • Previously: Sr Cloud Engineer / Architect at VMware, CrowdStrike, DataStax and SpringSource Who @helenaedelson github.com/helena
  3. 3. Tuplejump Open Source github.com/tuplejump • FiloDB - distributed, versioned, columnar analytical db for modern streaming workloads • Calliope - the first Spark-Cassandra integration • Stargate - Lucene indexer for Cassandra • SnackFS - HDFS-compatible file system for Cassandra 3
  4. 4. What Will We Talk About • The Problem Domain • Example Use Case • Rethinking Architecture – We don't have to look far to look back – Streaming – Revisiting the goal and the stack – Simplification
  5. 5. THE PROBLEM DOMAIN Delivering Meaning From A Flood Of Data 5
  6. 6. The Problem Domain Need to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures. 6
  7. 7. Translation How to build adaptable, elegant systems for complex analytics and learning tasks to run as large-scale clustered dataflows 7
  8. 8. How Much Data Yottabyte = quadrillion gigabytes or septillion bytes 8 We all have a lot of data • Terabytes • Petabytes... https://en.wikipedia.org/wiki/Yottabyte
  9. 9. Delivering Meaning • Deliver meaning in sec/sub-sec latency • Disparate data sources & schemas • Billions of events per second • High-latency batch processing • Low-latency stream processing • Aggregation of historical from the stream
  10. 10. While We Monitor, Predict & Proactively Handle • Massive event spikes • Bursty traffic • Fast producers / slow consumers • Network partitioning & Out of sync systems • DC down • Wait, we've DDOS'd ourselves from fast streams? • Autoscale issues – When we scale down VMs how do we not loose data?
  11. 11. And stay within our AWS / Rackspace budget
  12. 12. EXAMPLE CASE: CYBER SECURITY Hunting The Hunter 12
  13. 13. 13 • Track activities of international threat actor groups, nation-state, criminal or hactivist • Intrusion attempts • Actual breaches • Profile adversary activity • Analysis to understand their motives, anticipate actions and prevent damage Adversary Profiling & Hunting
  14. 14. 14 • Machine events • Endpoint intrusion detection • Anomalies/indicators of attack or compromise • Machine learning • Training models based on patterns from historical data • Predict potential threats • profiling for adversary Identification • Stream Processing
  15. 15. Data Requirements & Description • Streaming event data • Log messages • User activity records • System ops & metrics data • Disparate data sources • Wildly differing data structures 15
  16. 16. Massive Amounts Of Data 16 • One machine can generate 2+ TB per day • Tracking millions of devices • 1 million writes per second - bursty • High % writes, lower % reads • TTL
  17. 17. RETHINKING ARCHITECTURE 17
  18. 18. WE DON'T HAVE TO LOOK FAR TO LOOK BACK 18 Rethinking Architecture
  19. 19. 19 Most batch analytics flow from several years ago looked like...
  20. 20. STREAMING & DATA SCIENCE 20 Rethinking Architecture
  21. 21. Streaming I need fast access to historical data on the fly for predictive modeling with real time data from the stream. 21
  22. 22. Not A Stream, A Flood • Data emitters • Netflix: 1 - 2 million events per second at peak • 750 billion events per day • LinkedIn: > 500 billion events per day • Data ingesters • Netflix: 50 - 100 billion events per day • LinkedIn: 2.5 trillion events per day • 1 Petabyte of streaming data 22
  23. 23. Which Translates To • Do it fast • Do it cheap • Do it at scale 23
  24. 24. Challenges • Code changes at runtime • Distributed Data Consistency • Ordering guarantees • Complex compute algorithms 24
  25. 25. Oh, and don't lose data 25
  26. 26. Strategies • Partition For Scale & Data Locality • Replicate For Resiliency • Share Nothing • Fault Tolerance • Asynchrony • Async Message Passing • Memory Management 26 • Data lineage and reprocessing in runtime • Parallelism • Elastically Scale • Isolation • Location Transparency
  27. 27. AND THEN WE GREEKED OUT 27 Rethinking Architecture
  28. 28. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. 28
  29. 29. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. • An approach • Coined by Nathan Marz • This was a huge stride forward 29
  30. 30. 30 https://www.mapr.com/developercentral/lambda-architecture
  31. 31. Implementing Is Hard 32 • Real-time pipeline backed by KV store for updates • Many moving parts - KV store, real time, batch • Running similar code in two places • Still ingesting data to Parquet/HDFS • Reconcile queries against two different places
  32. 32. Performance Tuning & Monitoring: on so many systems 33 Also hard
  33. 33. Lambda Architecture An immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. 34
  34. 34. WAIT, DUAL SYSTEMS? 35 Challenge Assumptions
  35. 35. Which Translates To • Performing analytical computations & queries in dual systems • Implementing transformation logic twice • Duplicate Code • Spaghetti Architecture for Data Flows • One Busy Network 36
  36. 36. Why Dual Systems? • Why is a separate batch system needed? • Why support code, machines and running services of two analytics systems? 37 Counter productive on some level?
  37. 37. YES 38 • A unified system for streaming and batch • Real-time processing and reprocessing • Code changes • Fault tolerance http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html - Jay Kreps
  38. 38. ANOTHER ASSUMPTION: ETL 39 Challenge Assumptions
  39. 39. Extract, Transform, Load (ETL) 40 "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
  40. 40. Extract, Transform, Load (ETL) 41 ETL involves • Extraction of data from one system into another • Transforming it • Loading it into another system
  41. 41. Extract, Transform, Load (ETL) "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm 42 Also unnecessarily redundant and often typeless
  42. 42. ETL 43 • Each ETL step can introduce errors and risk • Can duplicate data after failover • Tools can cost millions of dollars • Decreases throughput • Increased complexity
  43. 43. ETL • Writing intermediary files • Parsing and re-parsing plain text 44
  44. 44. And let's duplicate the pattern over all our DataCenters 45
  45. 45. 46 These are not the solutions you're looking for
  46. 46. REVISITING THE GOAL & THE STACK 47
  47. 47. Removing The 'E' in ETL Thanks to technologies like Avro and Protobuf we don’t need the “E” in ETL. Instead of text dumps that you need to parse over multiple systems: Scala & Avro (e.g.) • Can work with binary data that remains strongly typed • A return to strong typing in the big data ecosystem 48
  48. 48. Removing The 'L' in ETL If data collection is backed by a distributed messaging system (e.g. Kafka) you can do real-time fanout of the ingested data to all consumers. No need to batch "load". • From there each consumer can do their own transformations 49
  49. 49. #NoMoreGreekLetterArchitectures 50
  50. 50. NoETL 51
  51. 51. Strategy Technologies Scalable Infrastructure / Elastic Spark, Cassandra, Kafka Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence Failure Detection Cassandra, Spark, Akka, Kafka Consensus & Gossip Cassandra & Akka Cluster Parallelism Spark, Cassandra, Kafka, Akka Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark, Kafka Location Transparency Akka, Spark, Cassandra, Kafka My Nerdy Chart 52
  52. 52. SMACK • Scala/Spark • Mesos • Akka • Cassandra • Kafka 53
  53. 53. Spark Streaming 54
  54. 54. Spark Streaming • One runtime for streaming and batch processing • Join streaming and static data sets • No code duplication • Easy, flexible data ingestion from disparate sources to disparate sinks • Easy to reconcile queries against multiple sources • Easy integration of KV durable storage 55
  55. 55. How do I merge historical data with data in the stream? 56
  56. 56. Join Streams With Static Data val ssc = new StreamingContext(conf, Milliseconds(500)) ssc.checkpoint("checkpoint") val staticData: RDD[(Int,String)] = ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func) val stream: DStream[(Int,String)] = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n)) .transform { events => events.join(staticData)) .saveToCassandra(keyspace,table) ssc.start() 57
  57. 57. Training Data Feature Extraction Model Training Model Testing Test Data Your Data Extract Data To Analyze Train your model to predict Spark MLLib 58
  58. 58. Spark Streaming & ML 59 val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))
  59. 59. Apache Mesos Open-source cluster manager developed at UC Berkeley. Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. 60
  60. 60. Akka High performance concurrency framework for Scala and Java • Fault Tolerance • Asynchronous messaging and data processing • Parallelization • Location Transparency • Local / Remote Routing • Akka: Cluster / Persistence / Streams 61
  61. 61. Akka Actors A distribution and concurrency abstraction • Compute Isolation • Behavioral Context Switching • No Exposed Internal State • Event-based messaging • Easy parallelism • Configurable fault tolerance 62
  62. 62. 63 Akka Actor Hierarchy http://www.slideshare.net/jboner/building-reactive-applications-with-akka-in-scala
  63. 63. import akka.actor._ class NodeGuardianActor(args...) extends Actor with SupervisorStrategy { val temperature = context.actorOf( Props(new TemperatureActor(args)), "temperature") val precipitation = context.actorOf( Props(new PrecipitationActor(args)), "precipitation") override def preStart(): Unit = { /* lifecycle hook: init */ } def receive : Actor.Receive = { case Initialized => context become initialized } def initialized : Actor.Receive = { case e: SomeEvent => someFunc(e) case e: OtherEvent => otherFunc(e) } } 64
  64. 64. Apache Cassandra • Extremely Fast • Extremely Scalable • Multi-Region / Multi-Datacenter • Always On • No single point of failure • Survive regional outages • Easy to operate • Automatic & configurable replication 65
  65. 65. Apache Cassandra • Very flexible data modeling (collections, user defined types) and changeable over time • Perfect for ingestion of real time / machine data • Huge community 66
  66. 66. Spark Cassandra Connector • NOSQL JOINS! • Write & Read data between Spark and Cassandra • Compatible with Spark 1.4 • Handles Data Locality for Speed • Implicit type conversions • Server-Side Filtering - SELECT, WHERE, etc. • Natural Timeseries Integration 67 http://github.com/datastax/spark-cassandra-connector
  67. 67. KillrWeather 68 http://github.com/killrweather/killrweather A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments. http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/ databricks/apps/weather
  68. 68. 69 • High Throughput Distributed Messaging • Decouples Data Pipelines • Handles Massive Data Load • Support Massive Number of Consumers • Distribution & partitioning across cluster nodes • Automatic recovery from broker failures
  69. 69. Spark Streaming & Kafka val context = new StreamingContext(conf, Seconds(1)) val wordCount = KafkaUtils.createStream(context, ...) .flatMap(_.split(" ")) .map(x => (x, 1)) .reduceByKey(_ + _) wordCount.saveToCassandra(ks,table) context.start() // start receiving and computing 70
  70. 70. 71 class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {
 import settings._ 
 val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
 ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
 .map(_._2.split(","))
 .map(RawWeatherData(_))
 
 kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
 /** RawWeatherData: wsid, year, month, day, oneHourPrecip */
 kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
 .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
 
 /** Now the [[StreamingContext]] can be started. */
 context.parent ! OutputStreamInitialized
 
 def receive : Actor.Receive = {…} } Gets the partition key: Data Locality Spark C* Connector feeds this to Spark Cassandra Counter column in our schema, no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.
  71. 71. 72 /** For a given weather station, calculates annual cumulative precip - or year to date. */
 class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {
 
 def receive : Actor.Receive = {
 case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)
 case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
 }
 
 /** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */
 def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =
 ssc.cassandraTable[Double](keyspace, dailytable)
 .select("precipitation")
 .where("wsid = ? AND year = ?", wsid, year)
 .collectAsync()
 .map(AnnualPrecipitation(_, wsid, year)) pipeTo requester
 
 /** Returns the 10 highest temps for any station in the `year`. */
 def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
 val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
 ssc.sparkContext.parallelize(aggregate).top(k).toSeq)
 
 ssc.cassandraTable[Double](keyspace, dailytable)
 .select("precipitation")
 .where("wsid = ? AND year = ?", wsid, year)
 .collectAsync().map(toTopK) pipeTo requester
 }
 }
  72. 72. A New Approach • One Runtime: streaming, scheduled • Simplified architecture • Allows us to • Write different types of applications • Write more type safe code • Write more reusable code 73
  73. 73. Need daily analytics aggregate reports? Do it in the stream, save results in Cassandra for easy reporting as needed - with data locality not offered by S3.
  74. 74. FiloDB Distributed, columnar database designed to run very fast analytical queries • Ingest streaming data from many streaming sources • Row-level, column-level operations and built in versioning offer greater flexibility than file-based technologies • Currently based on Apache Cassandra & Spark • github.com/tuplejump/FiloDB 75
  75. 75. FiloDB • Breakthrough performance levels for analytical queries • Performance comparable to Parquet • One to two orders of magnitude faster than Spark on Cassandra 2.x • Versioned - critical for reprocessing logic/code changes • Can simplify your infrastructure dramatically • Queries run in parallel in Spark for scale-out ad-hoc analysis • Space-saving techniques 76
  76. 76. WRAPPING UP 77
  77. 77. Architectyr? 78 "This is a giant mess" - Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps
  78. 78. 79 Simplified
  79. 79. 80
  80. 80. 81 www.tuplejump.com info@tuplejump.com@tuplejump
  81. 81. 82 @helenaedelson github.com/helena slideshare.net/helenaedelson THANK YOU!
  82. 82. I'm speaking at QCon SF on the broader topic of Streaming at Scale http://qconsf.com/sf2015/track/streaming-data-scale 83

×