The modern data customer wants data now. Batch workloads are not going anywhere, but at Scribd the future of our data platform requires more and more streaming data sets.
1. the revolution will be
streamed
building a real-time data platform
with spark and delta lake
a very professional presentation by
r tyler croy - scribd
2. new phone who dis
● director of platform engineering at scribd
○ leads data engineering and core platform teams
● mostly involved backend data services historically
● long-time open source contributor
● doesn't like waiting for data
3. scribd
● sign up at scribd.com
● one of the world's largest digital libraries
○ books
○ audiobooks
○ sheet music
○ user-uploaded documents
● many data. wow.
● our mission is to change the way the world reads
4. platform engineering
● helping to scale applications and data at scribd
● mostly focusing on our data platform
● enabling more internal teams to make better products
🙌 with data 🙌
🤗 alex, christian, dima, hamilton, kostas, max, qp, pavlo, rajiv, stas, trinity 🤗
7. such problems
● all data available in nightly batches
○ batch sizes grow with company growth
○ users wait 24+ hours for fresh data, A/B test results, etc
● data customers with stockholm syndrome
● on-premise
○ we are very not good at running it
○ hadoop + ruby + hive + spark + 🔥
○ elastic like a rock
○ 🔥 hdfs 🔥
12. "not ideal"
● tons of discussions/investigations into s3 consistency
○ s3guard i guess
● protocol buffers were an idea
● aws emr was not pleasant
● streams of data on one side, batches on the other
○ never the twain shall meet
17. OPTIMIZE
● optimize creates a new set of files
● turning a bunch of little files into a few big files
● can be done "online"
● (after the optimize need to do a vacuum)
🙌 small files problem disappears 🙌
18. performance
● every file is a compressed parqueet file
● every data mutation is a write, no updates
● fast
19. sinks and streaming sources
● a delta table can be a sink for streaming data
● a delta table can be the source for another streaming consumer
● backfills are awesome
○ everything downstream gets the updated records
20.
21.
22. not all smiles
● caveats when using a table as a streaming source
● requires a spark context in order to query
○ therefore requires a cluster
○ tableau integration not native, must use an always-on Spark cluster as an ODBC
proxy
● delta.io != delta in the databricks runtime™
● concurrent writes are … they're something
25. pretty good
● notebook collaboration made streaming job development easier
● developer self-service did too
● high utilization clusters with "delta cache" 👍
● really effective cluster auto-scaling*
● integrated well with our use of aws glue catalog
26. challenges
● not in our preferred region (yet)
● deployed on 7.0 beta runtime out of necessity
● jobs are really easy to terminate
● auto-scaling "doesn't work" for streaming jobs
● api tokens / account management not really there
● monitoring of production streaming applications is subpar
○ metrics
○ logs
○ alerting
33. more streams
● 50+ topics in kafka to persist into delta
○ many are underutilized data streams
● convert batch data sources to streams
○ database change data capture
○ data aggregations external to data platform
34. tuning
● optimizing on streaming tables can be costly
○ figuring out the right cadence, cluster size, etc
● should a streaming job handle one or many streams
35. productionize a thing
● operational maturity for spark streams one way or another
● continuous delivery
● end-to-end stream performance monitoring
● auto-scale the hard way (?)