The document discusses using Apache Spark to build a batch processing pipeline for aggregating music streaming user data. It presents a problem of aggregating a user's top liked genres for a music streaming app. It then covers building a basic Spark pipeline to ingest new data, transform it into aggregations, and update existing aggregations. It discusses using Delta Lake and incremental state updates to optimize the pipeline. It also covers approaches for continuous refinement of the aggregations.
5. MEET RISKIFIED
Riskified is an AI platform
powering the eCommerce
revolution. We have an
unparalleled ability to recognize
legitimate customers and keep
them moving toward conversion.
The world’s largest brands trust us
to increase revenue, manage risk,
and improve customer
interactions.
transactions
daily
total employees
130 in R&D
in funding
to date
clients include many
publicly traded
companies
600+
1M+
$229M
Enterprise
Focus
7. • BlueNote - an imagined music streaming app
• We’re asked to present an aggregation per user,
in the app’s UserInfo page
• There are many metrics required in this project.
We will focus on “top liked genres”
THE PROBLEM
SPACE
16. We gain:
• Faster response times
• Scalability
• Cleaner separation of concerns
However, we pay with:
• Data latency
• Diminished agility
• Storage & compute
IN PRAISE OF
AGGREGATIONS
Aggregation trade-offs
18. SPARK 101
Apache Spark is an
open-source unified analytics
engine for large-scale data
processing. Spark provides an
interface for programming
entire clusters with implicit data
parallelism and fault tolerance.
Wikipedia
19. • Spark itself is written in Scala
• Spark Applications can be written in Scala,
Java, Python or R
• Dataset vs. Dataframe
• Various data formats, Parquet as a default
◦ Columnar format
◦ Compression
◦ Nested data structures
SPARK 101
21. To bring in new raw
data to work with
To transform new
raw data into useful
aggregation
To allow updated
aggregation to be
accessible for apps
A BASIC PIPELINE
43. • ACID transactions over HDFS
• Updating parquets is slow
• Delta allows us to postpone optimization
• Integrates perfectly with Spark
• We are already using S3
How can Delta Lake help?
INCREMENTAL
STATE
47. Semigroup is a neat, simple typeclass that allows us to describe how
two instances of type A combine into a single A:
(A, A) => A
The Semigroup instance for
Map comes from here!
Semigroup for the win
INCREMENTAL STATE
59. • Proper data: what feeds our process?
• The population stage
• Testability
• Stack agnosticity
• S3 could be GCS or other
• Final DB can be anything
• Bring your own scheduler
AFTER THOUGHTS