"Spark: from interactivity to production and back", Yurii Ostapchuk

WHATISTHISABOUT
story on how I tried to save money, time and
organized work ow with Spark
spark
environments
work ow
pain & needs

NOTABOUT
how it works
code
libraries
solutions

AUDITORY
engineers 70%
analysts, who work with data, etc

PLAN
where it all started, di erent and common
needs
1. repeatable environment
2. deploying (to prod)
3. debugging & testing
4. business, ad-hoc querying
5. wrangling & exploration
6. etl & streaming applications

0.WHEREITALLSTARTED
dynamic env, startup
ad-tech, RTB, demand/supply matching
100k/s
real-time decision making
large scale analytics
covid<skull> => trinityaudio.ai - Text-to-speech
audio player

STACK
emr hbase
redis spark
storm memsql
mysql redshift
presto hive
kafka aerospike
elasticsearch gcp

LOTSOFFUN
cron jobs on emr nodes, random jobs on
rundeck
data processes are not centralized
etls in
python
js
php
jenkins & rundeck & go-cd
scala,java,akka,node.js,php,bash

DATA
100k/s - real-time decisions
timeseries events
behavioral data

1.REPEATABLE(DEV?)
ENVIRONMENT

SINGLEJOBDEPENDENCIES
kafka
spark
hive
hbase
2xmysql
s3

MOTIVATION
scripts, no tests, zoo
sandbox - need to run and play

CONSIDERATIONS
local vs cloud? exibility vs versatility
code vs data
parameterize code
reuse/mock data
input vs output
immutability vs mutability

OPTIONS
simply local
docker, vm
bad as a dev env
good for tests
emr stage
parameterize a lot
complexity

TIP:EMRMASTERASPRIMARY
DEVELOPMENTENTRY-POINT

WHATMATTERSMOST
automation
speed
clarity (reliability)

MOTIVATION
scala
spark
streaming and etl applications
dev, manual testing

OPTIONS
take1(default): push to branch ⇨ jenkins ⇨ jar ⇨ spark-submit (10min)
take2: sbt build ⇨ scp ⇨ spark-submit (3-4min)
take3: rsync source code to emr master
option(hardcore): emacs/vim develop directly on emr master
continuous rsync/lsyncd
.. ok, this is good enough for me

3.DEBUG&TEST
manual
automated: unit, integration

BIG-DATA&TESTING
bad input may break the whole pipeline
bad input will happen much faster
e ect of bug may take weeks, months until
noticed
distributed system e ects
huge data
what you can automate and what you cannot
divide & conquer
how to test structured streaming
SparkTest

MANUAL
separate cluster?
complexity: parameterize - cf, tools
time to start
money, do not forget to shut it down
same prod cluster?
interfering with existing jobs
isolation: yarn queues (complex)
data input: kafka
o oad data into some topic
data output - parameterize/mock

TDDDOESN'TWORKGOODHERE
what works better:
experiment ⇨ prototype ⇨ test ⇨ beta prod ⇨
..
spark, scala, sql
structured streaming for the win!
zeppelin

BLESSING:INEEDANINTERACTIVE
ENVIRONMENT!

NEEDS
one time maintenance operations
one time data processing
ad hoc querying
analytics vs ops: searching vs operating
all of these need interactive interface

TOOLS
shell
sql client
spark-shell
spark-sql
zeppelin

SOMEUSECASES
business doing sql
me wrangling and searching for patterns
me testing on the scale
business beta testing
me building streaming app / etl
me performing one-time operations

CONSIDERATIONS
presto vs spark
beauty of spark sql
hive metastore
thrift-server & sql clients
beauty of spark structured streaming
presto vs. spark sql
boring to rewrite sql
lack of custom code
speed
much easier to glue di erent storages
spark: sql vs scala api?

6.WRANGLING
I want to discover!

SOMEOTHER
databricks notebooks

INMYCASEI'MBOUNDTOSPARK
prototype in scala + spark
easily move from experiment/prototype ⇒
productionized streaming/etl application
reuse production code for further
experiments/prototypes

interactive wrangling
⇕
production application code

MOTIVATION
business needs quick prototype
requirements may change quickly
keep work ow optimal

BOTHTHESEWORKFLOWSNEED:
versioned code
shared dependencies
shared code (classpath)
uni ed work ow
1. repl, experiment
2. test at scale
3. acceptance
4. productionize

OPTIONS?
zeppelin + copy-paste
zeppelin + shared lib
jupyter + sparkmagic + livy + copy-paste
started my own project

ACTION
Share your experience
Star the project

"Spark: from interactivity to production and back", Yurii Ostapchuk

More Related Content

Similar to "Spark: from interactivity to production and back", Yurii Ostapchuk

More from Fwdays

Recently uploaded

"Spark: from interactivity to production and back", Yurii Ostapchuk