The Revolution Will be Streamed

the revolution will be
streamed
building a real-time data platform
with spark and delta lake
a very professional presentation by
r tyler croy - scribd

new phone who dis
● director of platform engineering at scribd
○ leads data engineering and core platform teams
● mostly involved backend data services historically
● long-time open source contributor
● doesn't like waiting for data

scribd
● sign up at scribd.com
● one of the world's largest digital libraries
○ books
○ audiobooks
○ sheet music
○ user-uploaded documents
● many data. wow.
● our mission is to change the way the world reads

platform engineering
● helping to scale applications and data at scribd
● mostly focusing on our data platform
● enabling more internal teams to make better products
🙌 with data 🙌
🤗 alex, christian, dima, hamilton, kostas, max, qp, pavlo, rajiv, stas, trinity 🤗

such problems
● all data available in nightly batches
○ batch sizes grow with company growth
○ users wait 24+ hours for fresh data, A/B test results, etc
● data customers with stockholm syndrome
● on-premise
○ we are very not good at running it
○ hadoop + ruby + hive + spark + 🔥
○ elastic like a rock
○ 🔥 hdfs 🔥

expected stack
● kafka
● spark streaming or kafka streams
● s3
● aws glue catalog

"not ideal"
● tons of discussions/investigations into s3 consistency
○ s3guard i guess
● protocol buffers were an idea
● aws emr was not pleasant
● streams of data on one side, batches on the other
○ never the twain shall meet

really ties the room together
● storage is the foundation for the platform
● delta solved a number of questions we had:
● tableau integration*

data consistency
● transactions
● s3 is eventually consistent
● "what I wrote is exactly what the other cluster will read"

OPTIMIZE
● optimize creates a new set of files
● turning a bunch of little files into a few big files
● can be done "online"
● (after the optimize need to do a vacuum)
🙌 small files problem disappears 🙌

performance
● every ﬁle is a compressed parqueet ﬁle
● every data mutation is a write, no updates
● fast

sinks and streaming sources
● a delta table can be a sink for streaming data
● a delta table can be the source for another streaming consumer
● backﬁlls are awesome
○ everything downstream gets the updated records

not all smiles
● caveats when using a table as a streaming source
● requires a spark context in order to query
○ therefore requires a cluster
○ tableau integration not native, must use an always-on Spark cluster as an ODBC
proxy
● delta.io != delta in the databricks runtime™
● concurrent writes are … they're something

building on
databricks
wow
very saas

pretty good
● notebook collaboration made streaming job development easier
● developer self-service did too
● high utilization clusters with "delta cache" 👍
● really effective cluster auto-scaling*
● integrated well with our use of aws glue catalog

challenges
● not in our preferred region (yet)
● deployed on 7.0 beta runtime out of necessity
● jobs are really easy to terminate
● auto-scaling "doesn't work" for streaming jobs
● api tokens / account management not really there
● monitoring of production streaming applications is subpar
○ metrics
○ logs
○ alerting

ceci n'est pas une pipeline
fastly
syslog
gateway
Kafka
kafka to
delta
view logs
fastly view_logs
batch
optimize
view
aggregations
ad-hoc
queries

the future
omg
so tomorrow
many
plan

more streams
● 50+ topics in kafka to persist into delta
○ many are underutilized data streams
● convert batch data sources to streams
○ database change data capture
○ data aggregations external to data platform

tuning
● optimizing on streaming tables can be costly
○ ﬁguring out the right cadence, cluster size, etc
● should a streaming job handle one or many streams

productionize a thing
● operational maturity for spark streams one way or another
● continuous delivery
● end-to-end stream performance monitoring
● auto-scale the hard way (?)

questions?
tech.scribd.com
github.com/rtyler
rtyler@brokenco.de

The Revolution Will be Streamed

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Revolution Will be Streamed

Similar to The Revolution Will be Streamed (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

The Revolution Will be Streamed