Data preparation and transformation - Spin your straw into gold - Tel Aviv Summit 2018

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Daniel Haviv
Analytics Specialist Solutions Architect, Amazon Web Services
Director Of Data, Datorama
Data preparation and transformation:
Spin your straw into gold
Raanan Raz
VP R&D, Datorama
Uri Sherman

Agenda
Challenges & Requirements
Amazon EMR - Introduction
Customer Story - Dataorama
AWS Glue - Introduction
Demo

ConsumeStore Process & AnalyzeIngest
Kinesis Data Streams
Kinesis Firehose
Delivery Streams
DynamoDB
AWS Lambda
Kinesis
Analytics
Raw Bucket
Parquet Bucket
Athena Redshift
Spectrum
QuickSight
SpeedLayerBatchLayer
Glue Data
Catalog
Spark/EMR Glue ETL
Real time
Web UI

Challenges & Requirements
Data
• Massive storage
capabilities
• Massive Parallel
Processing engine
• Ability to handle
flexible (if any)
schema
Tools
• In-depth knowledge
and experience with
the technology
• Operational effort
(install/configure/
maintain /upgrade)
Skills
• Too complex for human use
• Raw data <> consumable
data
• Different format and
schema requirements for
different teams

Amazon EMR

PIG
SQL
Amazon
EMR
Amazon S3
Hadoop ecosystem

Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent

Why EMR? Automation
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling

Why EMR? Decouple Storage and Compute
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3

Why EMR? – Compute Flexibility
Compute Memory Storage
Machine Learning
C4 Family
C5 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I3 Family
Large HDFS
General
Batch Process
M4 Family

Why EMR? Elastic
Scale Out Scale In
Auto Scale

Why EMR? Current
Application Open source release EMR release
Spark 1.5 September 9, 2015 September 2015
Spark 1.5.2 November 9, 2015 November 2015
Spark 1.6 January 4, 2016 January 2016
Spark 1.6.1 March 9, 2016 April 4, 2016
Spark 2.0 July 26, 2016 August 2, 2016
Spark 2.0.2 November 14, 2016 November 21, 2016
Spark 2.1.0 December 28, 2016 January 26, 2017
Spark 2.2.0 July 11, 2017 August 10, 2017
Spark 2.2.1 December 01, 2017 January 22, 2018

Why EMR? Low-cost
Spot instancesTransient clusters Reserved instances

Cost & Time
# CPUs
Time
# CPUs
Time
Wall clock time: 1 hourWall clock time: 10 hours

Datorama

Datorama:
Marketing
Intelligence
Raanan Raz, VP R&D
Uri Sherman, Director of Data
March 14, 2018

Founded in
By Ran Sarig, Efi Cohen
& Katrin Ribant
2012 +3000
Brands
+300
Agencies
+50
Publishers
+25
Industry
Verticals
16
Offices
worldwide
$50MFunding from
Lightspeed
Innovation
Endeavors
300+
Employees &
growing quickly

Datorama
is
Intelligence
for Marketing
Every performance, outcome &
investment across the customer
journey – all in one place.
So you can make
smarter decisions.

+300
Agencies
+3000
Brands
+20
Verticals
Clients from around the world
+25
Verticals

~400
Servers
4
Geo Locations
>40
Microservices
~1 PB
Raw Data
5B
Daily
Events Processed
5M
Daily
Analytical Queries
1 Petabyte
Growth in numbers

It’s a fragmented marketing world
Performance
Impact
Loyalty+
MARKETER
Growth
BRAND BRAND
REGION REGION REGION REGION
CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN
CHANNELCHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNELCHANNEL CHANNEL CHANNEL

and it’s not stopping… (+5000 Platforms)

Transformations at scale
• Extract and Transform data
• Calculated columns
• Vlookups/Fuzzy match
• Complex logic and iterations
• Sandboxed environment

Marketing data is NOT immutable
• External vendors have windows of reconciliations
(up to 6 months)
• Our users want to update/delete specific rows/set
• Our users love to backdate
• Most (if not all) big data solutions are append only and updating
the data is considered a heavy process

• Fact data
• Batch uploads (up to 1 billion rows per file)
– Updates existing data, Transactional, High throughput
• Interactive Sql queries
• Highly scalable
Product Scope
? SQLCSVCSVCSV SQL
SQL

Architecture Overview
EMR EMR
Loader
Service
Query
Service
csv orc
Hive Metastore
Amazon S3
upload-req query result-set

Storage Layout - Date Partitioning
/my_table
/20140313
/part01.orc
/part02.orc
/part03.orc
/20140314
/part01.orc
/part02.orc
Amazon
S3

Read and increment
table upload id1 Read input
file2 Read “to be updated”
partitions from S33 Merge the two
dataframes4
Reclaim stale
data offline,
periodically
7
Update hive
ALTER TABLE table_name [PARTITION
date=’20180314’] SET LOCATION
"/20180314_27";
6
Write out partitions
to new locations
e.g. /20180314_27
5
Atomic Update Flow

• Load / Query / Storage are completely decoupled
• Linear scale out
• L microservice is the driver program
– Single spark context per microservice instance
Important Notes

Contact us at
https://datorama.com/join-us
https://engineering.datorama.com/
We’re Hiring!

Raanan Raz, VP R&D
raanan@datorama.com
Uri Sherman, Director of Data
uri@datorama.com
Thank You!

Serverless ETL

AWS Glue – Overview
§ Hive Metastore compatible with enhanced functionality
§ Crawlers automatically extracts metadata and creates tables
§ Integrated with Amazon Athena, Amazon Redshift Spectrum
§ Run jobs on a serverless Spark platform
§ Provides flexible scheduling
§ Handles dependency resolution, monitoring and alerting
§ Auto-generates ETL code
§ Build on open frameworks – Spark: Python & Scala
§ Developer Endpoint with Interactive Notebook
Job Authoring
Job Execution
Data Catalog

AWS Glue – Developer Endpoint
Explore, visualize and develop using a personal, serverless environment with interactive REPL
and Notebooks.

Move data across storage systems
Unified view

Demo

dhaviv@amazon.com

Data preparation and transformation - Spin your straw into gold - Tel Aviv Summit 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data preparation and transformation - Spin your straw into gold - Tel Aviv Summit 2018

Similar to Data preparation and transformation - Spin your straw into gold - Tel Aviv Summit 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Data preparation and transformation - Spin your straw into gold - Tel Aviv Summit 2018