Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real Time Analytics with Dse

994 views

Published on

.

Published in: Technology
  • Be the first to comment

Real Time Analytics with Dse

  1. 1. Real Time Analytics with DataStax Enterprise Ryan Knight @Knight_Cloud Solution Engineer - Datastax
  2. 2. © 2014 DataStax, All Rights Reserved Introduction to Spark
  3. 3. © 2014 DataStax, All Rights Reserved Hadoop Limitations • Master / Slave Architecture • Every Processing Step requires Disk IO • Difficult API and Programming Model • Designed for batch-mode jobs • No even-streaming / real-time • Complex Ecosystem
  4. 4. © 2014 DataStax, All Rights Reserved Hadoop?
  5. 5. © 2014 DataStax, All Rights Reserved Apps in the early 2000s were written for Apps today are written for Single machines Clusters of machines Single core processors Multicore processors Expensive RAM Cheap RAM Expensive disk Cheap disk Slow networks Fast networks Few concurrent users Lots of concurrent users Small data sets Large data sets Latency in seconds Latency in milliseconds
  6. 6. What is Spark? • Fast and general compute engine for large-scale data processing • Fault Tolerant Distributed Datasets • Distributed Transformation on Datasets • Integrated Batch, Iterative and Streaming Analysis • In Memory Storage with Spill-over to Disk
  7. 7. © 2014 DataStax, All Rights Reserved Advantages of Spark • Improves efficiency through: • In-memory data sharing • General computation graphs - Lazy Evaluates Data • 10x faster on disk, 100x faster in memory than Hadoop MR • Improves usability through: • Rich APIs in Java, Scala, Py..?? • 2 to 5x less code • Interactive shell
  8. 8. © 2014 DataStax, All Rights Reserved
  9. 9. 10© 2015. All Rights Reserved. •Functional Paradigm is ideal for Data Analytics •Strongly Typed - Enforce Schema at Every Later •Immutable by Default - Event Logging •Declarative instead of Imperative - Focus on Transformation not Implementation Scala for Data Analytics
  10. 10. © 2014 DataStax, All Rights Reserved Spark Streaming
  11. 11. © 2014 DataStax, All Rights Reserved Spark Versus Spark Streaming
  12. 12. © 2014 DataStax, All Rights Reserved Spark Streaming General Architecture
  13. 13. © 2014 DataStax, All Rights Reserved DStream Micro Batches
  14. 14. © 2014 DataStax, All Rights Reserved Windowing
  15. 15. © 2014 DataStax, All Rights Reserved Spark Cassandra Connector
  16. 16. Spark is about Data Analytics • How do we get data into Spark? • How can we work with large datasets? • What do we do with the results of the analytics?
  17. 17. Spark Cassandra Connector
  18. 18. © 2014 DataStax, All Rights Reserved ●19 Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* Spark C* Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1-1000 Tokens 1001 -2000 Tokens … RDD’s read into different splits based on sets of tokens Spark Cassandra Connector
  19. 19. Connector Token Range Mapping Spark C* Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1-1000 Tokens 1001 -2000 Tokens … RDD’s read into different splits based on sets of tokens
  20. 20. Spark Cassandra Connector • Data locality-aware (speed) • Read from and Write to Cassandra • Cassandra Tables Exposed as RDD and DataFrames • Server-Side filters (where clauses) • Cross-table operations (JOIN, UNION, etc.) • Mapping of Java Types to Cassandra Types
  21. 21. Spark Cassandra Connector • Open Source Project • Requires maintaining separate Cassandra and Spark Clusters • Spark Master is not Highly Available without Zookeeper • Submitting Spark Applications requires setting hard coded Spark Master and Cassandra Locations
  22. 22. © 2014 DataStax, All Rights Reserved DataStax Enterprise Data Platform
  23. 23. © 2014 DataStax, All Rights Reserved. Confidential DataStax Enterprise Platform Workload Segregation w/out ETL 24 Cassandra OLTP Database Analytics Streaming and Analytics Search All Data Searchable Graph Graph Data Structure - Coming this year C* C C S A A
  24. 24. DataStax Enterprise Platform Workload Segregation w/out ETL 25© 2015. All Rights Reserved.
  25. 25. 26© 2015. All Rights Reserved. •DSE Analytic Nodes configured to run Spark •No need to run separate Spark Cluster •Simplified Deployment and Management •No need to specify Spark Master and Cassandra Host •High Availability of Spark Master DSE Analytics with Spark Internal / Administrative Benefits
  26. 26. DataStax Enterprise Platform Integrated Spark Analytics 27© 2015. All Rights Reserved.
  27. 27. 28© 2015. All Rights Reserved. •High Availability Spark Master with automatic leader election •Detects when Spark Master is down with gossip •Uses Paxos to elect Spark Master •Stores Spark Worker metadata in Cassandra •No need to run Zookeeper Spark Master High Availability
  28. 28. 29© 2015. All Rights Reserved. •Integration of Analytics and Search •Spark Job Server •SparkSQL and HiveQL access of Cassandra Data •Streaming Resiliency with w/ Kafka Direct API via Cassandra File System DSE Analytics with Spark Integration Benefits
  29. 29. DSE 4.8 Analytics + Search • Allows Analytics Jobs to use Solr Queries • Allows searching for data across partitions val table = sc.cassandraTable("music","albums") val result = table.select(“id","artist_name") .where(“solr_query='artist_name:Miles*'") .collect
  30. 30. 31 DSE Analytics Streaming Analysis DSE Analytics Batch Analysis Data Center 1 - US East Data Center 2 - US West replication replication Data Center Replication Spark Streaming from Kafka DSE Analytics Streaming Analysis DSE Analytics Batch Analysis Spark Streaming from Kafka Passive Kafka Active Kafka Network Traffic Analysis Architecture
  31. 31. Common Use Cases • Personalization • Banking Fraud Detection • Website Click Stream Analysis • Login Monitoring
  32. 32. © 2014 DataStax, All Rights Reserved Spark Streaming Demo
  33. 33. Spark Notebook 34© 2015. All Rights Reserved. C* C C A AA Notebook Notebook Notebook Spark Notebook Server Cassandra Cluster with Spark Connector
  34. 34. Apache Spark Notebook 35© 2015. All Rights Reserved. •Reactive / Dynamic Graphs based on Scala, SQL and DataFrames •Spark Streaming • Examples notebooks covering visualization, machine learning, streaming, graph analysis, genomics analysis •SVG / Sliders - interactive graphs •Tune and Configure Each Notebook Separately •https://github.com/andypetrella/spark-notebook
  35. 35. Demo of Streaming in the Real World - Spark At Scale Project 36© 2015. All Rights Reserved. •Based on Real World Use Cases •Simulate a real world streaming use case •Test throughput of Spark Streaming •Best Practices for scaling •https://github.com/retroryan/SparkAtScale
  36. 36. Spark At Scale Demo Application 37© 2015. All Rights Reserved. DataStax Enterprise Platform DataStax Enterprise Platform Web Service Legacy Systems
  37. 37. © 2014 DataStax, All Rights Reserved Best Practices for Spark Streaming
  38. 38. Spark Streaming with Kafka Direct Approach 39© 2015. All Rights Reserved. •Use Kafka Direct Approach (No Receivers) •Queries Kafka Directly •Automatically Parallelizes based on Kafka Partitions •Exactly Once Processing - Only Move Offset after Processing •Resiliency without copying data
  39. 39. Spark Streaming Deployment 40© 2015. All Rights Reserved. •Don’t build fat jars!!!! •spark-submit —package specify dependencies maven style •Test submit options to match load •--executor-memory 4G •--total-executor-cores 15
  40. 40. How do we Scale for Load and Traffic?
  41. 41. 42© 2015. All Rights Reserved.
  42. 42. Spark Streaming Monitoring 43© 2015. All Rights Reserved. Processing Time > Batch Duration = Total Delay Grows Out Of Memory Errors
  43. 43. Data Modeling using Event Sourcing 44© 2015. All Rights Reserved. •Append-Only Logging •Database of Facts •Snapshots or Roll-Ups •Why Delete Data any more? •Replay Events
  44. 44. © 2014 DataStax, All Rights Reserved Spark SQL and DataFrames
  45. 45. © 2014 DataStax, All Rights Reserved • Creating and Running Spark Programs Faster • Write less code • Read less data • Let the optimizer do the hard work • Spark SQL Catalyst optimizer Why Spark SQL?
  46. 46. © 2014 DataStax, All Rights Reserved • Distributed collection of data • Similar to a Table in a RDBMS • Common API for reading/writing data • API for selecting, filtering, aggregating 
 and plotting structured data • Similar to a Table in a RDBMS DataFrame
  47. 47. © 2014 DataStax, All Rights Reserved • Sources such as Cassandra, structured data files, tables in Hive, external databases, or existing RDDs. • Optimization and code generation through the Spark SQL Catalyst optimizer • Decorator around RDD • Previously SchemaRDD DataFrame Part 2
  48. 48. © 2014 DataStax, All Rights Reserved • Unified interface to reading/writing data in a variety of formats • Spark Notebook Example Write Less Code: Input & Output
  49. 49. © 2014 DataStax, All Rights Reserved Configuring Kafka for Scaling
  50. 50. Key to Scaling - Configuring Kafka Topics 51© 2015. All Rights Reserved. •Number of Partitions per Topic — Degree of parallelism •Directly Affects Spark Streaming Parallelism •bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 5 --topic ratings
  51. 51. Populating Kafka Topics 52© 2015. All Rights Reserved. val record = new ProducerRecord[String, String] (feederExtension.kafkaTopic, partNum, key, nxtRating.toString) 
 val future = feederExtension.producer.send(record, new Callback {
  52. 52. 53© 2015. All Rights Reserved.
  53. 53. Streaming: collect tweets Twitter API HDFS: dataset Spark SQL: ETL, queries MLlib: train classifier Spark: featurize HDFS: model Streaming: score tweets language filter Demo: Twitter Streaming Language Classifier Cassandra Cassandra
  54. 54. 1. extract text from the tweet https:// twitter.com/ andy_bf/status/ "Ceci n'est pas un tweet" 2. sequence text as tweet.sliding(2).t oSeq ("Ce", "ec", "ci", …, ) 3. convert bigrams into seq.map(_.hashCode ()) (2178, 3230, 3174, …, ) 4. index into sparse tf seq.map(_.hashCode () % 1000) (178, 230, 174, …, ) 5. increment feature Vector.sparse(1000 , …) (1000, [102, 104, …], [0.0455, 0.0455, Demo: Twitter Streaming Language Classifier From tweets to ML features, approximated as sparse vectors:
  55. 55. KMeans: Formal Definition (ignore this)
  56. 56. KMeans: How it really works…
  57. 57. KMeans: How it really works…
  58. 58. Demo: Twitter Streaming Language Classifier Sample Code + Output:
 gist.github.com/ceteri/835565935da932cb59a2 val sc = new SparkContext(new SparkConf()) val ssc = new StreamingContext(conf, Seconds(5))   val tweets = TwitterUtils.createStream(ssc, Utils.getAuth) val statuses = tweets.map(_.getText)   val model = new KMeansModel(ssc.sparkContext.objectFile[Vector] (modelFile.toString).collect())   val filteredTweets = statuses .filter(t => model.predict(Utils.featurize(t)) == clust) filteredTweets.print()   ssc.start() ssc.awaitTermination() CLUSTER 1: TLあんまり⾒見ないけど @くれたっら いつでもくっるよ٩(δωδ)۶ そういえばディスガイアも今⽇日か CLUSTER 4: ‫صدام‬ ‫بعد‬ ‫روحت‬ ‫العروبه‬ ‫قالوا‬ ‫العروبه‬ ‫تحيى‬ ‫سلمان‬ ‫مع‬ ‫واقول‬ RT @vip588: √ ‫مي‬ ‫فولو‬ √ ‫متابعني‬ ‫زيادة‬ √ ‫االن‬ ‫للمتواجدين‬ vip588 √ ‫ما‬ ‫يلتزم‬ ‫ما‬ ‫اللي‬ √ ‫رتويت‬ ‫عمل‬ ‫للي‬ ‫فولو‬ √ ‫للتغريدة‬ ‫رتويت‬ √ ‫باك‬ ‫فولو‬ ‫بيستفيد‬ … ‫سورة‬ ‫ن‬
  59. 59. Thank you

×