Data preparation is always a challenge. Why care about infrastructure?
Come learn how to deploy your Spark jobs in minutes using our managed services, EMR & Glue and focus on your business needs.
23. Founded in
By Ran Sarig, Efi Cohen
& Katrin Ribant
2012 +3000
Brands
+300
Agencies
+50
Publishers
+25
Industry
Verticals
16
Offices
worldwide
$50MFunding from
Lightspeed
Innovation
Endeavors
300+
Employees &
growing quickly
32. Transformations at scale
• Extract and Transform data
• Calculated columns
• Vlookups/Fuzzy match
• Complex logic and iterations
• Sandboxed environment
33. Marketing data is NOT immutable
• External vendors have windows of reconciliations
(up to 6 months)
• Our users want to update/delete specific rows/set
• Our users love to backdate
• Most (if not all) big data solutions are append only and updating
the data is considered a heavy process
38. Read and increment
table upload id1 Read input
file2 Read “to be updated”
partitions from S33 Merge the two
dataframes4
Reclaim stale
data offline,
periodically
7
Update hive
ALTER TABLE table_name [PARTITION
date=’20180314’] SET LOCATION
"/20180314_27";
6
Write out partitions
to new locations
e.g. /20180314_27
5
Atomic Update Flow
39. • Load / Query / Storage are completely decoupled
• Linear scale out
• L microservice is the driver program
– Single spark context per microservice instance
Important Notes