Going from experiment to deployed prototype as fast as possible in a dynamic startup environment is invaluable. Being able to respond quickly to changes not less important.
From interactive ad-hoc analysis to production applications with Spark and back - this is a story of one spirited engineer trying to make his life a little easier and a little more efficient while wrangling the data, writing Scala code, deploying Spark applications. The problems faced, the lessons learned, the options found and some smart solutions and ideas - this is what we will go through.
5. PLAN
where it all started, di erent and common
needs
1. repeatable environment
2. deploying (to prod)
3. debugging & testing
4. business, ad-hoc querying
5. wrangling & exploration
6. etl & streaming applications
11. LOTSOFFUN
cron jobs on emr nodes, random jobs on
rundeck
data processes are not centralized
etls in
python
js
php
jenkins & rundeck & go-cd
scala,java,akka,node.js,php,bash
24. OPTIONS
take1(default): push to branch ⇨ jenkins ⇨ jar ⇨ spark-submit (10min)
take2: sbt build ⇨ scp ⇨ spark-submit (3-4min)
take3: rsync source code to emr master
option(hardcore): emacs/vim develop directly on emr master
continuous rsync/lsyncd
.. ok, this is good enough for me
26. BIG-DATA&TESTING
bad input may break the whole pipeline
bad input will happen much faster
e ect of bug may take weeks, months until
noticed
distributed system e ects
huge data
what you can automate and what you cannot
divide & conquer
how to test structured streaming
SparkTest
27. MANUAL
separate cluster?
complexity: parameterize - cf, tools
time to start
money, do not forget to shut it down
same prod cluster?
interfering with existing jobs
isolation: yarn queues (complex)
data input: kafka
o oad data into some topic
data output - parameterize/mock
31. NEEDS
one time maintenance operations
one time data processing
ad hoc querying
analytics vs ops: searching vs operating
all of these need interactive interface
35. SOMEUSECASES
business doing sql
me wrangling and searching for patterns
me testing on the scale
business beta testing
me building streaming app / etl
me performing one-time operations
37. CONSIDERATIONS
presto vs spark
beauty of spark sql
hive metastore
thrift-server & sql clients
beauty of spark structured streaming
presto vs. spark sql
boring to rewrite sql
lack of custom code
speed
much easier to glue di erent storages
spark: sql vs scala api?
44. INMYCASEI'MBOUNDTOSPARK
prototype in scala + spark
easily move from experiment/prototype ⇒
productionized streaming/etl application
reuse production code for further
experiments/prototypes