WHATISTHISABOUT
story on how I tried to save money, time and
organized work ow with Spark
spark
environments
work ow
pain & needs
NOTABOUT
how it works
code
libraries
solutions
AUDITORY
engineers 70%
analysts, who work with data, etc
PLAN
where it all started, di erent and common
needs
1. repeatable environment
2. deploying (to prod)
3. debugging & testing
4. business, ad-hoc querying
5. wrangling & exploration
6. etl & streaming applications
0.WHEREITALLSTARTED
dynamic env, startup
ad-tech, RTB, demand/supply matching
100k/s
real-time decision making
large scale analytics
covid<skull> => trinityaudio.ai - Text-to-speech
audio player
STACK
emr hbase
redis spark
storm memsql
mysql redshift
presto hive
kafka aerospike
elasticsearch gcp
LOTSOFFUN
cron jobs on emr nodes, random jobs on
rundeck
data processes are not centralized
etls in
python
js
php
jenkins & rundeck & go-cd
scala,java,akka,node.js,php,bash
DATA
100k/s - real-time decisions
timeseries events
behavioral data
1.REPEATABLE(DEV?)
ENVIRONMENT
SINGLEJOBDEPENDENCIES
kafka
spark
hive
hbase
2xmysql
s3
MOTIVATION
scripts, no tests, zoo
sandbox - need to run and play
CONSIDERATIONS
local vs cloud? exibility vs versatility
code vs data
parameterize code
reuse/mock data
input vs output
immutability vs mutability
OPTIONS
simply local
docker, vm
bad as a dev env
good for tests
emr stage
parameterize a lot
complexity
EMR-BIGDATACLUSTER
TIP:EMRMASTERASPRIMARY
DEVELOPMENTENTRY-POINT
2.DEPLOY
WHATMATTERSMOST
automation
speed
clarity (reliability)
MOTIVATION
scala
spark
streaming and etl applications
dev, manual testing
OPTIONS
take1(default): push to branch ⇨ jenkins ⇨ jar ⇨ spark-submit (10min)
take2: sbt build ⇨ scp ⇨ spark-submit (3-4min)
take3: rsync source code to emr master
option(hardcore): emacs/vim develop directly on emr master
continuous rsync/lsyncd
.. ok, this is good enough for me
3.DEBUG&TEST
manual
automated: unit, integration
BIG-DATA&TESTING
bad input may break the whole pipeline
bad input will happen much faster
e ect of bug may take weeks, months until
noticed
distributed system e ects
huge data
what you can automate and what you cannot
divide & conquer
how to test structured streaming
SparkTest
MANUAL
separate cluster?
complexity: parameterize - cf, tools
time to start
money, do not forget to shut it down
same prod cluster?
interfering with existing jobs
isolation: yarn queues (complex)
data input: kafka
o oad data into some topic
data output - parameterize/mock
TDDDOESN'TWORKGOODHERE
what works better:
experiment ⇨ prototype ⇨ test ⇨ beta prod ⇨
..
spark, scala, sql
structured streaming for the win!
zeppelin
BLESSING:INEEDANINTERACTIVE
ENVIRONMENT!
4.DATAACCESS&
OPERATIONS
NEEDS
one time maintenance operations
one time data processing
ad hoc querying
analytics vs ops: searching vs operating
all of these need interactive interface
TOOLS
shell
sql client
spark-shell
spark-sql
zeppelin
PROBLEM:HOWTOSHARE
CODEBASE?
5.AD-HOCQUERYINGAND
BUSINESS
SOMEUSECASES
business doing sql
me wrangling and searching for patterns
me testing on the scale
business beta testing
me building streaming app / etl
me performing one-time operations
PRESTO-"DATABASEOFDATABASES"
CONSIDERATIONS
presto vs spark
beauty of spark sql
hive metastore
thrift-server & sql clients
beauty of spark structured streaming
presto vs. spark sql
boring to rewrite sql
lack of custom code
speed
much easier to glue di erent storages
spark: sql vs scala api?
6.WRANGLING
I want to discover!
JUPYTERHUB
EMRNOTEBOOKS
DATABREW
EMRSTUDIO
SOMEOTHER
databricks notebooks
INMYCASEI'MBOUNDTOSPARK
prototype in scala + spark
easily move from experiment/prototype ⇒
productionized streaming/etl application
reuse production code for further
experiments/prototypes
interactive wrangling
⇕
production application code
MOTIVATION
business needs quick prototype
requirements may change quickly
keep work ow optimal
BOTHTHESEWORKFLOWSNEED:
versioned code
shared dependencies
shared code (classpath)
uni ed work ow
1. repl, experiment
2. test at scale
3. acceptance
4. productionize
OPTIONS?
zeppelin + copy-paste
zeppelin + shared lib
jupyter + sparkmagic + livy + copy-paste
started my own project
SOMETHINGWORTHCHECKING
dbt
ACTION
Share your experience
Star the project
😃

"Spark: from interactivity to production and back", Yurii Ostapchuk