All Things Open 2015 - Spark & Storm: When & Where?

Mammoth Data
Mammoth DataMammoth Data
Spark & Storm: When & Where?
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
www.mammothdata.com | @mammothdataco
● Lead Consultant on all things DevOps and Spark
● @carsondial on Twitter
Me!
www.mammothdata.com | @mammothdataco
● Quick overview of Spark Streaming
● Reasons why Spark Streaming can be tricky in practice
● Performance and tuning tips we’ve learnt over the past two years
● …and when to pack it all in and use Storm instead
What This Talk Is About
www.mammothdata.com | @mammothdataco
This IS WEB SCALE!
www.mammothdata.com | @mammothdataco
● I kid, Rails!
● (mostly)
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Spark & Storm - millions of requests / second on commodity
hardware
● Different problems at different scales!
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
● Based around the Resilient Distributed Dataset (RDD) primitive
Spark
www.mammothdata.com | @mammothdataco
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
Spark Streaming — In Production?
● Yes!
● (Alibaba, AutoTrader, Cisco, Netflix, etc.)
www.mammothdata.com | @mammothdataco
● Streaming by running batches very quickly!
● Batch length: can be as low as 0.5s / batch
● Every X seconds, get Y records (DStream/RDDs)
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
● Using same implementation (mostly) for batch and stream
processing (Lambda Architecture hipster points ahoy!)
● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.
Spark Streaming — Good Things
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● What happens if you require sub-second latency?
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
Spark Streaming — I’m so sorry.
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● Data builds up in executors
● Executors run out of memory…
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● “Hey, we forgot to tell you Ops people that we have a major new
client adding stuff into the firehose sometime today. That’s fine,
right?”
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
Spark Streaming — It Will Be Okay
www.mammothdata.com | @mammothdataco
● As a former Ops person:
● WE WILL REMEMBER.
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● Do you need low-latency?
● If so, a 10-minute nap is advisable!
● Everybody else, let’s dive in…
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Easiest method — alter the batch window until it’s all fine!
● Tiny batches provide tight execution times!
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Use Kafka.
● Data source with the most love (e.g. exactly-once semantics
without Write Ahead Logs and receiver-less operation in 1.3+)
● (other sources get the features…eventually)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Use Scala.
● CPython = slower in execution
● PyPy is much faster…but…
● New features always come to Scala first.
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● (or Java if you really must)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Spark Streaming = data receivers + Spark
● spark.cores.max = x * number of receivers
● For Great Data Locality and Parallelism!
Spark Streaming — Cores
www.mammothdata.com | @mammothdataco
● Are you using a foreachRDD loop?
rdd.foreachRDD{ rdd =>
rdd.cache()
…
rdd.unpersist()
}
Spark Streaming — Caching
www.mammothdata.com | @mammothdataco
● If routing to multiple stores / iterating over an RDD multiple
times using cache() is a quick win
● It really shouldn’t work so well…
Spark Streaming — Caching
www.mammothdata.com | @mammothdataco
● Hurrah for Spark 1.5!
● spark.streaming.backpressure.enabled = true
● Spark dynamically alters incoming data rates (keeping the data in
Kafka rather than in the executors)
● Works for all data sources (for once!)
Spark Streaming — Backpressure
www.mammothdata.com | @mammothdataco
● I really need that low-latency response!
Storm
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
Storm
www.mammothdata.com | @mammothdataco
Spark
“Very Good, Sir”
www.mammothdata.com | @mammothdataco
Storm
“Here you go!”
www.mammothdata.com | @mammothdataco
● Stream of tuples
● Bolts
● Spouts
● Topologies
Storm Concepts
www.mammothdata.com | @mammothdataco
● Unbounded stream of tuples
● Tuples are defined via schema (usual base types plus custom
serializers)
Storm — Streams
www.mammothdata.com | @mammothdataco
● Sources of tuples in a topology
● Read from external sources (e.g. Kafka) and emitting them
● Can emit multiple streams from a spout!
Storm — Spouts
www.mammothdata.com | @mammothdataco
● Where your processing happens
● Roll your own aggregations / filtering / windowing
● Bolts can feed into other bolts
● Potentially easier to test than Spark Streaming
● Many Bolt connectors for external sources (e.g. Cassandra,
Redis, Hive, etc)
Storm — Bolts
www.mammothdata.com | @mammothdataco
● The DAG of the spouts and bolts
● Built programmatically in code and submitted to the Storm
cluster
● Flux - Do It In YAML (and then complain about whitespace)
Storm — Topologies
www.mammothdata.com | @mammothdataco
● Each bolt or spout runs 'tasks' across the cluster
● How parallelism works in Storm
● Set in topology submission
Storm — Tasks
www.mammothdata.com | @mammothdataco
● Where the topology runs
● 1 worker = 1 JVM
● Tasks run as threads on a worker
● Storm distributes tasks evenly across cluster
Storm — Workers
www.mammothdata.com | @mammothdataco
● True Streaming
● Tuples processed as they enter topology - low latency
● Scales far beyond Spark Streaming (currently)
Storm — Good Things
www.mammothdata.com | @mammothdataco
● Battle-tested at Twitter & Yahoo!
● Yahoo! has 300-node clusters and working to support 1000+
nodes
● Single node clocked at over 1.5m tuples / second at Twitter
Storm — Good Things
www.mammothdata.com | @mammothdataco
● Very DIY (bring your own aggregations, ML, etc)
● Your DAG construction may not be optimal
● Operationally more complex (and Storm WebUI is more primitive)
● Where’s Me REPL?
Storm — Bad Things
www.mammothdata.com | @mammothdataco
Spark or Storm?
www.mammothdata.com | @mammothdataco
● SLA on latency?
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though simply because it’s possible doesn’t mean you’ll get it!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Insane data needs (e.g. ~100m records/second?)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though, again, it’s not a magic bullet!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● For almost anything else? Spark.
● High-level vs. Low-level
● Each new version of Spark delivers improvements!
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Other frameworks that show promise:
○ Flink
○ Apex
○ Samza
○ Heron (Twitter’s not-public Storm replacement)
Other Listing Magazines Are Available
www.mammothdata.com | @mammothdataco
Questions?
1 of 52

Recommended

Intro to Apache Spark by
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
38K views34 slides
Intro to Apache Spark - Lab by
Intro to Apache Spark - LabIntro to Apache Spark - Lab
Intro to Apache Spark - LabMammoth Data
485 views28 slides
A Modern Data Architecture for Risk Management... For Financial Services by
A Modern Data Architecture for Risk Management... For Financial ServicesA Modern Data Architecture for Risk Management... For Financial Services
A Modern Data Architecture for Risk Management... For Financial ServicesMammoth Data
803 views23 slides
2015 Red Hat Summit - Open Source in Financial Services by
2015 Red Hat Summit - Open Source in Financial Services2015 Red Hat Summit - Open Source in Financial Services
2015 Red Hat Summit - Open Source in Financial ServicesMammoth Data
264 views20 slides
How To Run A Successful BI Project with Hadoop by
How To Run A Successful BI Project with HadoopHow To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with HadoopMammoth Data
304 views22 slides
Cloud Worst Practices by
Cloud Worst PracticesCloud Worst Practices
Cloud Worst PracticesMammoth Data
200 views27 slides

More Related Content

Recently uploaded

AMAZON PRODUCT RESEARCH.pdf by
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdfJerikkLaureta
15 views13 slides
Melek BEN MAHMOUD.pdf by
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdfMelekBenMahmoud
14 views1 slide
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
17 views1 slide
Data-centric AI and the convergence of data and model engineering: opportunit... by
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
34 views40 slides
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
165 views92 slides
Uni Systems for Power Platform.pptx by
Uni Systems for Power Platform.pptxUni Systems for Power Platform.pptx
Uni Systems for Power Platform.pptxUni Systems S.M.S.A.
50 views21 slides

Recently uploaded(20)

AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta15 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma17 views
Data-centric AI and the convergence of data and model engineering: opportunit... by Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier34 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman27 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2216 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex9 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet55 views
DALI Basics Course 2023 by Ivory Egg
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023
Ivory Egg14 views
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex19 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana12 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab15 views

Featured

ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
21.3K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
5.2K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
5.9K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
4.5K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.2K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.7K views42 slides

Featured(20)

ChatGPT and the Future of Work - Clark Boyd by Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd21.3K views
Getting into the tech field. what next by Tessa Mero
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero5.2K views
Google's Just Not That Into You: Understanding Core Updates & Search Intent by Lily Ray
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray5.9K views
Time Management & Productivity - Best Practices by Vit Horky
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky169.7K views
The six step guide to practical project management by MindGenius
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius36.6K views
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright... by RachelPearson36
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson3612.6K views
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.4K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.6K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.2K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.3K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views

All Things Open 2015 - Spark & Storm: When & Where?