Bartosz.Jankiewicz@gmail.com, Scalapolis 2016
Make yourself a scalable
pipeline with Apache Spark
Google Data Flow, 2014
The future of data processing is unbounded data.
Though bounded data will always have an
important and useful place, it is semantically
subsumed by its unbounded counter- part.
Jaikumar Vijayan, eWeek 2015
Analyst firms like Forrester expect demand for streaming
analytics services and technologies to grow in the next
few years as more organisations try to extract value from
the huge volumes of data being generated these days
from transactions, Web clickstreams, mobile applications
and cloud services.
❖ Integrate user activity information
❖ Enable nearly real-time analytics
❖ Scale to millions visits per day
❖ Respond to rapidly emerging requirements
❖ Enable data-science techniques on top of collected data
❖ Do the above with reasonable cost
IngestionSources
Canonical architecture
web
sensor
audit-event
micro-
service
Apache Spark
❖ Started in 2009
❖ Developed in Scala with Akka
❖ Polyglot: Currently supports Scala, Java, Python and R
❖ The largest BigData community as of 2015
Spark use-cases
❖ Data integration and ETL
❖ Interactive analytics
❖ Machine learning and advanced analytics
Apache Spark
❖ Scalable
❖ Fast
❖ Elegant programming
model
❖ Fault tolerant
Scalable
• Scalable by design
• Scales to hundreds of nodes
• Proven in production by
many companies
Apache Spark
❖ Scalable
❖ Fast
❖ Elegant programming
model
❖ Fault tolerant
Fast
• You can optimise both for
latency and throughput
• Reduced hardware appetite
due various optimisations
• Further improvements
added with Structured
Streaming in Spark 2.0
Apache Spark
❖ Scalable
❖ Fast
❖ Elegant programming
model
❖ Fault tolerant
Programming model
• Functional paradigm
• Easy to run, easy to test
• Polyglot (R, Scala, Python,
Java)
• Batch and streaming APIs
are very similar
• REPL - a.k.a. Spark shell
Apache Spark
❖ Scalable
❖ Fast
❖ Elegant programming
model
❖ Fault tolerant
Fault tollerancy
• Data is distributed and
replicated
• Seamlessly recovers from
node failure
• Zero data loss guarantees
due to write ahead log
Runtime model
Driver Program
Executor #1
Your code Spark
Context
Executor #2
Executor #3
Executor #4
p1
p4
p2
p5
p3
p6
RDD - Resilient Distributed Dataset
Driver Program
Executor #1 Executor #2 Executor #3 Executor #4
val textFile = sc.textFile(“hdfs://…")
Data node #1 Data node #2 Data node #3 Data node #4
val rdd: RDD[String] =
sc.textFile(…)
val wordsRDD = rdd
.flatMap(line => line.split(" "))
val lengthHistogram = wordsRDD
.groupBy(word => word.length)
.collect
val aWords = wordsRDD
.filter(word =>
word.startsWith(“a”))
.saveAsHadoopFile(“hdfs://…”)
Meet DAG
B
C
E
D
F
A
B
C E
D F
A
DStream
❖ Series of small and deterministic batch jobs
❖ Spark chops live stream into batches
❖ Each micro-batch processing produces a result
time [s]1 2 3 4 5 6
RDD1 RDD2 RDD3 RDD4 RDD5 RDD6
val dstream: DStream[String] = …
val wordsStream = dstream
.flatMap(line => line.split(" "))
.transform(_.map(_.toUpper))
.countByValue()
.print()
Streaming program
It’s not a free lunch
❖ The abstractions are leaking
❖ You need to control level of parallelism
❖ You need to understand impact of transformations
❖ Don’t materialise partitions in forEachPartition
operation
Performance factors
• Network operations
• Data locality
• Total number of cores
• How much you can chunk
your work
• Memory usage and GC
• Serialization
Level of parallelism
❖ Number of receivers aligned with number of executors
❖ Number of threads aligned with number of cores and
nature of operations - blocking or non-blocking
❖ Your data needs to be chunked to make use of your
hardware
Stateful transformations
❖ Stateful transformation example
❖ Stateful DStream operators can have infinite lineages
❖ That leads to high failure-recovery time
❖ Spark solves that problem with checkpointing
val actions[(String, UserAction)] = …
val hotCategories = 

actions.mapWithState(StateSpec.function(stateFunction))
Monitoring
❖ Spark Web UI
❖ Metrics:
❖ Console
❖ Ganglia Sink
❖ Graphite Sink (works great with Grafana)
❖ JMX
❖ REST API
Types of sources
❖ Basic sources:
❖ Sockets, HDFS, Akka actors
❖ Advanced sources:
❖ Kafka, Kinesis, Flume, MQTT
❖ Custom sources:
❖ Receiver interface
Apache Kafka
Greasing the wheels for big data
❖ Incredibly fast message bus
❖ Distributed and fault tolerant
❖ Highly scalable
❖ Strong order guarantees
❖ Easy to replicate across multiple regions
Broker 1
Producer
Broker 2
Consumer
Spark 💕 Kafka
❖ Native integration through
direct-stream API
❖ Offsets information are stored
in write ahead logs
❖ Restart of Spark driver reloads
offsets which weren't
processed
❖ Needs to explicitly enabled
Storage consideration
❖ HDFS works well for large, batch workloads
❖ HBase works well for random reads and writes
❖ HDFS is well suited for analytical queries
❖ HBase is well suited for interaction with web pages and
certain types of range queries
❖ It’s pays off to persist all data in raw format
Lessons learnt
Architecture
web
Final thoughts
❖ Start with reasonably large batch duration ~10 seconds
❖ Adopt your level of parallelism
❖ Use Kryo for faster serialisation
❖ Don’t even start without good monitoring
❖ Find bottlenecks using Spark UI and monitoring
❖ The issues usually in surrounding Spark environment
?
The End
Bartosz Jankiewicz
@oborygen
bartosz.jankiewicz@gmail.com
References
❖ http://spark.apache.org/docs/latest/streaming-
programming-guide.html
❖ https://www.gitbook.com/book/jaceklaskowski/mastering-
apache-spark/details
❖ http://milinda.pathirage.org/kappa-architecture.com/
❖ http://lambda-architecture.net
❖ http://www.hammerlab.org/2015/02/27/monitoring-spark-
with-graphite-and-grafana/

Apache Spark Streaming

  • 1.
    Bartosz.Jankiewicz@gmail.com, Scalapolis 2016 Makeyourself a scalable pipeline with Apache Spark
  • 2.
    Google Data Flow,2014 The future of data processing is unbounded data. Though bounded data will always have an important and useful place, it is semantically subsumed by its unbounded counter- part.
  • 3.
    Jaikumar Vijayan, eWeek2015 Analyst firms like Forrester expect demand for streaming analytics services and technologies to grow in the next few years as more organisations try to extract value from the huge volumes of data being generated these days from transactions, Web clickstreams, mobile applications and cloud services.
  • 4.
    ❖ Integrate useractivity information ❖ Enable nearly real-time analytics ❖ Scale to millions visits per day ❖ Respond to rapidly emerging requirements ❖ Enable data-science techniques on top of collected data ❖ Do the above with reasonable cost
  • 5.
  • 6.
    Apache Spark ❖ Startedin 2009 ❖ Developed in Scala with Akka ❖ Polyglot: Currently supports Scala, Java, Python and R ❖ The largest BigData community as of 2015
  • 7.
    Spark use-cases ❖ Dataintegration and ETL ❖ Interactive analytics ❖ Machine learning and advanced analytics
  • 8.
    Apache Spark ❖ Scalable ❖Fast ❖ Elegant programming model ❖ Fault tolerant Scalable • Scalable by design • Scales to hundreds of nodes • Proven in production by many companies
  • 9.
    Apache Spark ❖ Scalable ❖Fast ❖ Elegant programming model ❖ Fault tolerant Fast • You can optimise both for latency and throughput • Reduced hardware appetite due various optimisations • Further improvements added with Structured Streaming in Spark 2.0
  • 10.
    Apache Spark ❖ Scalable ❖Fast ❖ Elegant programming model ❖ Fault tolerant Programming model • Functional paradigm • Easy to run, easy to test • Polyglot (R, Scala, Python, Java) • Batch and streaming APIs are very similar • REPL - a.k.a. Spark shell
  • 11.
    Apache Spark ❖ Scalable ❖Fast ❖ Elegant programming model ❖ Fault tolerant Fault tollerancy • Data is distributed and replicated • Seamlessly recovers from node failure • Zero data loss guarantees due to write ahead log
  • 12.
    Runtime model Driver Program Executor#1 Your code Spark Context Executor #2 Executor #3 Executor #4 p1 p4 p2 p5 p3 p6
  • 13.
    RDD - ResilientDistributed Dataset Driver Program Executor #1 Executor #2 Executor #3 Executor #4 val textFile = sc.textFile(“hdfs://…") Data node #1 Data node #2 Data node #3 Data node #4
  • 14.
    val rdd: RDD[String]= sc.textFile(…) val wordsRDD = rdd .flatMap(line => line.split(" ")) val lengthHistogram = wordsRDD .groupBy(word => word.length) .collect val aWords = wordsRDD .filter(word => word.startsWith(“a”)) .saveAsHadoopFile(“hdfs://…”) Meet DAG B C E D F A B C E D F A
  • 15.
    DStream ❖ Series ofsmall and deterministic batch jobs ❖ Spark chops live stream into batches ❖ Each micro-batch processing produces a result time [s]1 2 3 4 5 6 RDD1 RDD2 RDD3 RDD4 RDD5 RDD6
  • 16.
    val dstream: DStream[String]= … val wordsStream = dstream .flatMap(line => line.split(" ")) .transform(_.map(_.toUpper)) .countByValue() .print() Streaming program
  • 17.
    It’s not afree lunch ❖ The abstractions are leaking ❖ You need to control level of parallelism ❖ You need to understand impact of transformations ❖ Don’t materialise partitions in forEachPartition operation
  • 18.
    Performance factors • Networkoperations • Data locality • Total number of cores • How much you can chunk your work • Memory usage and GC • Serialization
  • 19.
    Level of parallelism ❖Number of receivers aligned with number of executors ❖ Number of threads aligned with number of cores and nature of operations - blocking or non-blocking ❖ Your data needs to be chunked to make use of your hardware
  • 20.
    Stateful transformations ❖ Statefultransformation example ❖ Stateful DStream operators can have infinite lineages ❖ That leads to high failure-recovery time ❖ Spark solves that problem with checkpointing val actions[(String, UserAction)] = … val hotCategories = 
 actions.mapWithState(StateSpec.function(stateFunction))
  • 21.
    Monitoring ❖ Spark WebUI ❖ Metrics: ❖ Console ❖ Ganglia Sink ❖ Graphite Sink (works great with Grafana) ❖ JMX ❖ REST API
  • 26.
    Types of sources ❖Basic sources: ❖ Sockets, HDFS, Akka actors ❖ Advanced sources: ❖ Kafka, Kinesis, Flume, MQTT ❖ Custom sources: ❖ Receiver interface
  • 27.
    Apache Kafka Greasing thewheels for big data ❖ Incredibly fast message bus ❖ Distributed and fault tolerant ❖ Highly scalable ❖ Strong order guarantees ❖ Easy to replicate across multiple regions Broker 1 Producer Broker 2 Consumer
  • 28.
    Spark 💕 Kafka ❖Native integration through direct-stream API ❖ Offsets information are stored in write ahead logs ❖ Restart of Spark driver reloads offsets which weren't processed ❖ Needs to explicitly enabled
  • 29.
    Storage consideration ❖ HDFSworks well for large, batch workloads ❖ HBase works well for random reads and writes ❖ HDFS is well suited for analytical queries ❖ HBase is well suited for interaction with web pages and certain types of range queries ❖ It’s pays off to persist all data in raw format
  • 30.
  • 31.
  • 32.
    Final thoughts ❖ Startwith reasonably large batch duration ~10 seconds ❖ Adopt your level of parallelism ❖ Use Kryo for faster serialisation ❖ Don’t even start without good monitoring ❖ Find bottlenecks using Spark UI and monitoring ❖ The issues usually in surrounding Spark environment
  • 33.
  • 34.
  • 35.
    References ❖ http://spark.apache.org/docs/latest/streaming- programming-guide.html ❖ https://www.gitbook.com/book/jaceklaskowski/mastering- apache-spark/details ❖http://milinda.pathirage.org/kappa-architecture.com/ ❖ http://lambda-architecture.net ❖ http://www.hammerlab.org/2015/02/27/monitoring-spark- with-graphite-and-grafana/