Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spark Streaming &
Spark SQL
Yousun Jeong
jerryjung@sk.com
History - Spark
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become one
of the larg...
History - Spark
Some key points about Spark:
• handles batch, interactive, and real-time within a single
framework
• nativ...
Data Sharing in MR
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
Spark
Benchmark Test
databricks.com/blog/2014/11/05/spark-officially- sets-a-new-record-in-large-scale-
sorting.html
RDD
Resilient Distributed Datasets (RDD) are the primary
abstraction in Spark – a fault-tolerant collection of
elements th...
Fault Tolerance
• An RDD is an immutable, deterministically re-
computable, distributed dataset.
• RDD tracks lineage info...
Benefit of Spark
Spark help us to have the gains in processing speed and implement various
big data applications easily and...
Big Data
Big Data is not just “big”
The 3V of Big Data
Big Data Processing
1. Batch Processing
• processing data en masse
• big & complex
• higher latencies ex) MR
2. Stream Pro...
Spark Streaming Integration
Spark Streaming In Action
import org.apache.spark.streaming._ 

import org.apache.spark.streaming.StreamingContext._ 



/...
Spark UI
Spark SQL
Spark SQL In Action
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingDataTable =...
Spark SQL In Action
val allCandidates = sql("""
SELECT userId,
age,
latitude,
logitude
FROM Users
WHERE subscribed = FALSE...
MR vs RDD - Compute an
Average
RDD vs DF - Compute an
Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.r...
Spark 2.0 : Structured
Streaming
• Structured Streaming
• High-level streaming API built on Spark SQL engine
• Runs the sa...
Spark 2.0 Example: Page
View Count
Input: records in Kafka
Query: select count(*) group by page, minute(evtime)
Trigger:“e...
Spark 2.0 Use Case: Fraud
Detection
Spark 2.0 Performance
Q & A
Thank You!
Upcoming SlideShare
Loading in …5
×

Spark streaming , Spark SQL

682 views

Published on

Spark streaming , Spark SQL

Published in: Data & Analytics
  • Be the first to comment

Spark streaming , Spark SQL

  1. 1. Spark Streaming & Spark SQL Yousun Jeong jerryjung@sk.com
  2. 2. History - Spark Developed in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations “Organizations that are looking at big data challenges – including collection, ETL, storage, exploration and analytics – should consider Spark for its in-memory performance and the breadth of its model. It supports advanced analytics solutions on Hadoop clusters, including the iterative model required for machine learning and graph analysis.” Gartner, Advanced Analytics and Data Science (2014)
  3. 3. History - Spark Some key points about Spark: • handles batch, interactive, and real-time within a single framework • native integration with Java, Python, Scala programming at a higher level of abstraction • multi-step Directed Acrylic Graphs (DAGs). 
 many stages compared to just Hadoop Map and Reduce only.
  4. 4. Data Sharing in MR http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
  5. 5. Spark
  6. 6. Benchmark Test databricks.com/blog/2014/11/05/spark-officially- sets-a-new-record-in-large-scale- sorting.html
  7. 7. RDD Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel There are currently two types: • parallelized collections – take an existing Scala collection and run functions on it in parallel • Hadoop datasets – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop
  8. 8. Fault Tolerance • An RDD is an immutable, deterministically re- computable, distributed dataset. • RDD tracks lineage info rebuild lost data
  9. 9. Benefit of Spark Spark help us to have the gains in processing speed and implement various big data applications easily and speedily ▪ Support for Event Stream Processing ▪ Fast Data Queries in Real Time ▪ Improved Programmer Productivity ▪ Fast Batch Processing of Large Data Set Why I use spark …
  10. 10. Big Data Big Data is not just “big” The 3V of Big Data
  11. 11. Big Data Processing 1. Batch Processing • processing data en masse • big & complex • higher latencies ex) MR 2. Stream Processing • one-at-a-time processing • computations are relatively simple and generally independent • sub-second latency ex) Storm 3. Micro-Batching • small batch size (batch+streaming)
  12. 12. Spark Streaming Integration
  13. 13. Spark Streaming In Action import org.apache.spark.streaming._ 
 import org.apache.spark.streaming.StreamingContext._ 
 
 // create a StreamingContext with a SparkConf configuration val ssc = new StreamingContext(sparkConf, Seconds(10)) 
 
 // create a DStream that will connect to serverIP:serverPort val lines = ssc.socketTextStream(serverIP, serverPort) 
 
 // split each line into words 
 val words = lines.flatMap(_.split(" ")) 
 
 // count each word in each batch 
 val pairs = words.map(word => (word, 1)) 
 val wordCounts = pairs.reduceByKey(_ + _) 
 
 // print a few of the counts to the console wordCounts.print() 
 
 ssc.start() // Start the computation
 ssc.awaitTermination() // Wait for the computation to terminate
  14. 14. Spark UI
  15. 15. Spark SQL
  16. 16. Spark SQL In Action // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId”"") // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)
  17. 17. Spark SQL In Action val allCandidates = sql(""" SELECT userId, age, latitude, logitude FROM Users WHERE subscribed = FALSE”"") // Results of ML algorithms can be used as tables // in subsequent SQL statements. case class Score(userId: Int, score: Double) val scores = allCandidates.map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } scores.registerAsTable("Scores")
  18. 18. MR vs RDD - Compute an Average
  19. 19. RDD vs DF - Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people").groupBy("name").agg("name", avg("age")).collect()
  20. 20. Spark 2.0 : Structured Streaming • Structured Streaming • High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames • Event time, windowing, sessions, sources & sinks • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serve using JDBC • Change queries at runtime • Build and apply ML models
  21. 21. Spark 2.0 Example: Page View Count Input: records in Kafka Query: select count(*) group by page, minute(evtime) Trigger:“every 5 sec” Output mode: “update-in-place”, into MySQL sink logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(logs.user_id).
 agg(sum(logs.time)) .write.format("jdbc") .stream("jdbc:mysql//...")
  22. 22. Spark 2.0 Use Case: Fraud Detection
  23. 23. Spark 2.0 Performance
  24. 24. Q & A
  25. 25. Thank You!

×