Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

390 views

Published on

This is first part of Spark 2 new features overview
This topic covers API changes; Structured Streaming; Encoders; Memory Management in Spark; Tungsten issues; Catalyst features)

Published in: Engineering

Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

  1. 1. Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM
  2. 2. About With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015
  3. 3. 3Spark 2 from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  4. 4. 4Spark 2 from Zinoviev Alexey Sprk Dvlprs! Let’s start!
  5. 5. 5Spark 2 from Zinoviev Alexey < SPARK 2.0
  6. 6. 6Spark 2 from Zinoviev Alexey Modern Java in 2016Big Data in 2014
  7. 7. 7Spark 2 from Zinoviev Alexey Big Data in 2017
  8. 8. 8Spark 2 from Zinoviev Alexey Machine Learning EVERYWHERE
  9. 9. 9Spark 2 from Zinoviev Alexey Machine Learning vs Traditional Programming
  10. 10. 10Spark 2 from Zinoviev Alexey
  11. 11. 11Spark 2 from Zinoviev Alexey Something wrong with HADOOP
  12. 12. 12Spark 2 from Zinoviev Alexey Hadoop is not SEXY
  13. 13. 13Spark 2 from Zinoviev Alexey Whaaaat?
  14. 14. 14Spark 2 from Zinoviev Alexey Map Reduce Job Writing
  15. 15. 15Spark 2 from Zinoviev Alexey MR code
  16. 16. 16Spark 2 from Zinoviev Alexey Hadoop Developers Right Now
  17. 17. 17Spark 2 from Zinoviev Alexey Iterative Calculations 10x – 100x
  18. 18. 18Spark 2 from Zinoviev Alexey MapReduce vs Spark
  19. 19. 19Spark 2 from Zinoviev Alexey MapReduce vs Spark
  20. 20. 20Spark 2 from Zinoviev Alexey MapReduce vs Spark
  21. 21. 21Spark 2 from Zinoviev Alexey SPARK INTRO
  22. 22. 22Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  23. 23. 23Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  24. 24. 24Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  25. 25. 25Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  26. 26. 26Spark 2 from Zinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  27. 27. 27Spark 2 from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  28. 28. 28Spark 2 from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  29. 29. 29Spark 2 from Zinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  30. 30. 30Spark 2 from Zinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  31. 31. 31Spark 2 from Zinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  32. 32. 32Spark 2 from Zinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  33. 33. 33Spark 2 from Zinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  34. 34. 34Spark 2 from Zinoviev Alexey RDD Factory
  35. 35. 35Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  36. 36. 36Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  37. 37. 37Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  38. 38. 38Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  39. 39. 39Spark 2 from Zinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  40. 40. 40Spark 2 from Zinoviev Alexey It’s very easy to understand Graph algorithms def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)( vprog: (VertexID, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)], mergeMsg: (A, A) => A) : Graph[VD, ED] = { <Your Pregel Computations> }
  41. 41. 41Spark 2 from Zinoviev Alexey SPARK 2.0 DISCUSSION
  42. 42. 42Spark 2 from Zinoviev Alexey Spark Family
  43. 43. 43Spark 2 from Zinoviev Alexey Spark Family
  44. 44. 44Spark 2 from Zinoviev Alexey Case #0 : How to avoid DStreams with RDD-like API?
  45. 45. 45Spark 2 from Zinoviev Alexey Continuous Applications
  46. 46. 46Spark 2 from Zinoviev Alexey Continuous Applications cases • Updating data that will be served in real time • Extract, transform and load (ETL) • Creating a real-time version of an existing batch job • Online machine learning
  47. 47. 47Spark 2 from Zinoviev Alexey Write Batches
  48. 48. 48Spark 2 from Zinoviev Alexey Catch Streaming
  49. 49. 49Spark 2 from Zinoviev Alexey The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data.
  50. 50. 50Spark 2 from Zinoviev Alexey Batch // Read JSON once from S3 logsDF = spark.read.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .write.parquet("s3://out")
  51. 51. 51Spark 2 from Zinoviev Alexey Real Time // Read JSON continuously from S3 logsDF = spark.readStream.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .writeStream.parquet("s3://out") .start()
  52. 52. 52Spark 2 from Zinoviev Alexey Unlimited Table
  53. 53. 53Spark 2 from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  54. 54. 54Spark 2 from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  55. 55. 55Spark 2 from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  56. 56. 56Spark 2 from Zinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count() Don’t forget to start Streaming
  57. 57. 57Spark 2 from Zinoviev Alexey WordCount with Structured Streaming
  58. 58. 58Spark 2 from Zinoviev Alexey Structured Streaming provides … • fast & scalable • fault-tolerant • end-to-end with exactly-once semantic • stream processing • ability to use DataFrame/DataSet API for streaming
  59. 59. 59Spark 2 from Zinoviev Alexey Structured Streaming provides (in dreams) … • fast & scalable • fault-tolerant • end-to-end with exactly-once semantic • stream processing • ability to use DataFrame/DataSet API for streaming
  60. 60. 60Spark 2 from Zinoviev Alexey Let’s UNION streaming and static DataSets
  61. 61. 61Spark 2 from Zinoviev Alexey Let’s UNION streaming and static DataSets org.apache.spark.sql. AnalysisException: Union between streaming and batch DataFrames/Datasets is not supported;
  62. 62. 62Spark 2 from Zinoviev Alexey Let’s UNION streaming and static DataSets Go to UnsupportedOperationChecker.scala and check your operation
  63. 63. 63Spark 2 from Zinoviev Alexey Case #1 : We should think about optimization in RDD terms
  64. 64. 64Spark 2 from Zinoviev Alexey Single Thread collection
  65. 65. 65Spark 2 from Zinoviev Alexey No perf issues, right?
  66. 66. 66Spark 2 from Zinoviev Alexey The main concept more partitions = more parallelism
  67. 67. 67Spark 2 from Zinoviev Alexey Do it parallel
  68. 68. 68Spark 2 from Zinoviev Alexey I’d like NARROW
  69. 69. 69Spark 2 from Zinoviev Alexey Map, filter, filter
  70. 70. 70Spark 2 from Zinoviev Alexey GroupByKey, join
  71. 71. 71Spark 2 from Zinoviev Alexey Case #2 : DataFrames suggest mix SQL and Scala functions
  72. 72. 72Spark 2 from Zinoviev Alexey History of Spark APIs
  73. 73. 73Spark 2 from Zinoviev Alexey RDD rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  74. 74. 74Spark 2 from Zinoviev Alexey History of Spark APIs
  75. 75. 75Spark 2 from Zinoviev Alexey SQL rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  76. 76. 76Spark 2 from Zinoviev Alexey Expression rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  77. 77. 77Spark 2 from Zinoviev Alexey History of Spark APIs
  78. 78. 78Spark 2 from Zinoviev Alexey DataSet rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  79. 79. 79Spark 2 from Zinoviev Alexey Case #2 : DataFrame is referring to data attributes by name
  80. 80. 80Spark 2 from Zinoviev Alexey DataSet = RDD’s types + DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  81. 81. 81Spark 2 from Zinoviev Alexey DataSet = RDD’s types + DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  82. 82. 82Spark 2 from Zinoviev Alexey Structured APIs in SPARK
  83. 83. 83Spark 2 from Zinoviev Alexey Unified API in Spark 2.0 DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now
  84. 84. 84Spark 2 from Zinoviev Alexey Define case class case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  85. 85. 85Spark 2 from Zinoviev Alexey Read JSON case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  86. 86. 86Spark 2 from Zinoviev Alexey Filter by Field case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  87. 87. 87Spark 2 from Zinoviev Alexey Case #3 : Spark has many contexts
  88. 88. 88Spark 2 from Zinoviev Alexey Spark Session • New entry point in spark for creating datasets • Replaces SQLContext, HiveContext and StreamingContext • Move from SparkContext to SparkSession signifies move away from RDD
  89. 89. 89Spark 2 from Zinoviev Alexey Spark Session val sparkSession = SparkSession.builder .master("local") .appName("spark session example") .getOrCreate() val df = sparkSession.read .option("header","true") .csv("src/main/resources/names.csv") df.show()
  90. 90. 90Spark 2 from Zinoviev Alexey No, I want to create my lovely RDDs
  91. 91. 91Spark 2 from Zinoviev Alexey Where’s parallelize() method?
  92. 92. 92Spark 2 from Zinoviev Alexey RDD? case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  93. 93. 93Spark 2 from Zinoviev Alexey Case #4 : Spark uses Java serialization A LOT
  94. 94. 94Spark 2 from Zinoviev Alexey Two choices to distribute data across cluster • Java serialization By default with ObjectOutputStream • Kryo serialization Should register classes (no support of Serialazible)
  95. 95. 95Spark 2 from Zinoviev Alexey The main problem: overhead of serializing Each serialized object contains the class structure as well as the values
  96. 96. 96Spark 2 from Zinoviev Alexey The main problem: overhead of serializing Each serialized object contains the class structure as well as the values Don’t forget about GC
  97. 97. 97Spark 2 from Zinoviev Alexey Tungsten Compact Encoding
  98. 98. 98Spark 2 from Zinoviev Alexey Encoder’s concept Generate bytecode to interact with off-heap & Give access to attributes without ser/deser
  99. 99. 99Spark 2 from Zinoviev Alexey Encoders
  100. 100. 100Spark 2 from Zinoviev Alexey No custom encoders
  101. 101. 101Spark 2 from Zinoviev Alexey Case #5 : Not enough storage levels 
  102. 102. 102Spark 2 from Zinoviev Alexey Caching in Spark • Frequently used RDD can be stored in memory • One method, one short-cut: persist(), cache() • SparkContext keeps track of cached RDD • Serialized or deserialized Java objects
  103. 103. 103Spark 2 from Zinoviev Alexey Full list of options • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  104. 104. 104Spark 2 from Zinoviev Alexey Spark Core Storage Level • MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  105. 105. 105Spark 2 from Zinoviev Alexey Spark Streaming Storage Level • MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER (default for Spark Streaming) • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  106. 106. 106Spark 2 from Zinoviev Alexey Developer API to make new Storage Levels
  107. 107. 107Spark 2 from Zinoviev Alexey What’s the most popular file format in BigData?
  108. 108. 108Spark 2 from Zinoviev Alexey Case #6 : External libraries to read CSV
  109. 109. 109Spark 2 from Zinoviev Alexey Easy to read CSV data = sqlContext.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/datasets/samples/users.csv") data.cache() data.createOrReplaceTempView(“users") display(data)
  110. 110. 110Spark 2 from Zinoviev Alexey Case #7 : How to measure Spark performance?
  111. 111. 111Spark 2 from Zinoviev Alexey You'd measure performance!
  112. 112. 112Spark 2 from Zinoviev Alexey TPCDS 99 Queries http://bit.ly/2dObMsH
  113. 113. 113Spark 2 from Zinoviev Alexey
  114. 114. 114Spark 2 from Zinoviev Alexey How to benchmark Spark
  115. 115. 115Spark 2 from Zinoviev Alexey Special Tool from Databricks Benchmark Tool for SparkSQL https://github.com/databricks/spark-sql-perf
  116. 116. 116Spark 2 from Zinoviev Alexey Spark 2 vs Spark 1.6
  117. 117. 117Spark 2 from Zinoviev Alexey Case #8 : What’s faster: SQL or DataSet API?
  118. 118. 118Spark 2 from Zinoviev Alexey Job Stages in old Spark
  119. 119. 119Spark 2 from Zinoviev Alexey Scheduler Optimizations
  120. 120. 120Spark 2 from Zinoviev Alexey Catalyst Optimizer for DataFrames
  121. 121. 121Spark 2 from Zinoviev Alexey Unified Logical Plan DataFrame = Dataset[Row]
  122. 122. 122Spark 2 from Zinoviev Alexey Bytecode
  123. 123. 123Spark 2 from Zinoviev Alexey DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)
  124. 124. 124Spark 2 from Zinoviev Alexey Case #9 : Why does explain() show so many Tungsten things?
  125. 125. 125Spark 2 from Zinoviev Alexey How to be effective with CPU • Runtime code generation • Exploiting cache locality • Off-heap memory management
  126. 126. 126Spark 2 from Zinoviev Alexey Tungsten’s goal Push performance closer to the limits of modern hardware
  127. 127. 127Spark 2 from Zinoviev Alexey Maybe something UNSAFE?
  128. 128. 128Spark 2 from Zinoviev Alexey UnsafeRowFormat • Bit set for tracking null values • Small values are inlined • For variable-length values are stored relative offset into the variablelength data section • Rows are always 8-byte word aligned • Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation
  129. 129. 129Spark 2 from Zinoviev Alexey Case #10 : Can I influence on Memory Management in Spark?
  130. 130. 130Spark 2 from Zinoviev Alexey Case #11 : Should I tune generation’s stuff?
  131. 131. 131Spark 2 from Zinoviev Alexey Cached Data
  132. 132. 132Spark 2 from Zinoviev Alexey During operations
  133. 133. 133Spark 2 from Zinoviev Alexey For your needs
  134. 134. 134Spark 2 from Zinoviev Alexey For Dark Lord
  135. 135. 135Spark 2 from Zinoviev Alexey IN CONCLUSION
  136. 136. 136Spark 2 from Zinoviev Alexey We have no ability… • join structured streaming and other sources to handle it • one unified ML API • GraphX rethinking and redesign • Custom encoders • Datasets everywhere • integrate with something important
  137. 137. 137Spark 2 from Zinoviev Alexey Roadmap • Support other data sources (not only S3 + HDFS) • Transactional updates • Dataset is one DSL for all operations • GraphFrames + Structured MLLib • Tungsten: custom encoders • The RDD-based API is expected to be removed in Spark 3.0
  138. 138. 138Spark 2 from Zinoviev Alexey And we can DO IT! Real-Time Data-Marts Batch Data-Marts Relations Graph Ontology Metadata Search Index Events & Alarms Real-time Dashboarding Events & Alarms All Raw Data backup is stored here Real-time Data Ingestion Batch Data Ingestion Real-Time ETL & CEP Batch ETL & Raw Area Scheduler Internal External Social HDFS → CFS as an option Time-Series Data Titan & KairosDB store data in Cassandra Push Events & Alarms (Email, SNMP etc.)
  139. 139. 139Spark 2 from Zinoviev Alexey First Part
  140. 140. 140Spark 2 from Zinoviev Alexey Second Part
  141. 141. 141Spark 2 from Zinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  142. 142. 142Spark 2 from Zinoviev Alexey Any questions?

×