SlideShare a Scribd company logo
1 of 47
Download to read offline
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Daniel Haviv
Analytics Specialist Solutions Architect, Amazon Web Services
Director Of Data, Datorama
Data preparation and transformation:
Spin your straw into gold
Raanan Raz
VP R&D, Datorama
Uri Sherman
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Agenda
Challenges & Requirements
Amazon EMR - Introduction
Customer Story - Dataorama
AWS Glue - Introduction
Demo
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
ConsumeStore Process & AnalyzeIngest
Kinesis Data Streams
Kinesis Firehose
Delivery Streams
DynamoDB
AWS Lambda
Kinesis
Analytics
Raw Bucket
Parquet Bucket
Athena Redshift
Spectrum
QuickSight
SpeedLayerBatchLayer
Glue Data
Catalog
Spark/EMR Glue ETL
Real time
Web UI
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Challenges & Requirements
Data
• Massive storage
capabilities
• Massive Parallel
Processing engine
• Ability to handle
flexible (if any)
schema
Tools
• In-depth knowledge
and experience with
the technology
• Operational effort
(install/configure/
maintain /upgrade)
Skills
• Too complex for human use
• Raw data <> consumable
data
• Different format and
schema requirements for
different teams
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Amazon EMR
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
PIG
SQL
Amazon
EMR
Amazon S3
Hadoop ecosystem
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR? Automation
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR? Decouple Storage and Compute
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR? – Compute Flexibility
Compute Memory Storage
Machine Learning
C4 Family
C5 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I3 Family
Large HDFS
General
Batch Process
M4 Family
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR? Elastic
Scale Out Scale In
Auto Scale
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR? Current
Application Open source release EMR release
Spark 1.5 September 9, 2015 September 2015
Spark 1.5.2 November 9, 2015 November 2015
Spark 1.6 January 4, 2016 January 2016
Spark 1.6.1 March 9, 2016 April 4, 2016
Spark 2.0 July 26, 2016 August 2, 2016
Spark 2.0.2 November 14, 2016 November 21, 2016
Spark 2.1.0 December 28, 2016 January 26, 2017
Spark 2.2.0 July 11, 2017 August 10, 2017
Spark 2.2.1 December 01, 2017 January 22, 2018
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why EMR? Low-cost
Spot instancesTransient clusters Reserved instances
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Cost & Time
# CPUs
Time
# CPUs
Time
Wall clock time: 1 hourWall clock time: 10 hours
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Datorama
Datorama:
Marketing
Intelligence
Raanan Raz, VP R&D
Uri Sherman, Director of Data
March 14, 2018
Founded in
By Ran Sarig, Efi Cohen
& Katrin Ribant
2012 +3000
Brands
+300
Agencies
+50
Publishers
+25
Industry
Verticals
16
Offices
worldwide
$50MFunding from
Lightspeed
Innovation
Endeavors
300+
Employees &
growing quickly
Datorama
is
Intelligence
for Marketing
Every performance, outcome &
investment across the customer
journey – all in one place.
So you can make
smarter decisions.
+300
Agencies
+3000
Brands
+20
Verticals
Clients from around the world
+25
Verticals
~400
Servers
4
Geo Locations
>40
Microservices
~1 PB
Raw Data
5B
Daily
Events Processed
5M
Daily
Analytical Queries
1 Petabyte
Growth in numbers
Challenges
It’s a fragmented marketing world
Performance
Impact
Loyalty+
MARKETER
Growth
BRAND BRAND
REGION REGION REGION REGION
CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN
CHANNELCHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNELCHANNEL CHANNEL CHANNEL
and it’s not stopping… (+5000 Platforms)
It’s being pushed around
Data is unstructured
Transformations at scale
• Extract and Transform data
• Calculated columns
• Vlookups/Fuzzy match
• Complex logic and iterations
• Sandboxed environment
Marketing data is NOT immutable
• External vendors have windows of reconciliations
(up to 6 months)
• Our users want to update/delete specific rows/set
• Our users love to backdate
• Most (if not all) big data solutions are append only and updating
the data is considered a heavy process
The Solution
• Fact data
• Batch uploads (up to 1 billion rows per file)
– Updates existing data, Transactional, High throughput
• Interactive Sql queries
• Highly scalable
Product Scope
? SQLCSVCSVCSV SQL
SQL
Architecture Overview
EMR EMR
Loader
Service
Query
Service
csv orc
Hive Metastore
Amazon S3
upload-req query result-set
Storage Layout - Date Partitioning
/my_table
/20140313
/part01.orc
/part02.orc
/part03.orc
/20140314
/part01.orc
/part02.orc
Amazon
S3
Read and increment
table upload id1 Read input
file2 Read “to be updated”
partitions from S33 Merge the two
dataframes4
Reclaim stale
data offline,
periodically
7
Update hive
ALTER TABLE table_name [PARTITION
date=’20180314’] SET LOCATION
"/20180314_27";
6
Write out partitions
to new locations
e.g. /20180314_27
5
Atomic Update Flow
• Load / Query / Storage are completely decoupled
• Linear scale out
• L microservice is the driver program
– Single spark context per microservice instance
Important Notes
Contact us at
https://datorama.com/join-us
https://engineering.datorama.com/
We’re Hiring!
Raanan Raz, VP R&D
raanan@datorama.com
Uri Sherman, Director of Data
uri@datorama.com
Thank You!
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Serverless ETL
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
AWS Glue – Overview
§ Hive Metastore compatible with enhanced functionality
§ Crawlers automatically extracts metadata and creates tables
§ Integrated with Amazon Athena, Amazon Redshift Spectrum
§ Run jobs on a serverless Spark platform
§ Provides flexible scheduling
§ Handles dependency resolution, monitoring and alerting
§ Auto-generates ETL code
§ Build on open frameworks – Spark: Python & Scala
§ Developer Endpoint with Interactive Notebook
Job Authoring
Job Execution
Data Catalog
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
AWS Glue – Developer Endpoint
Explore, visualize and develop using a personal, serverless environment with interactive REPL
and Notebooks.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Move data across storage systems
Unified view
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Demo
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
dhaviv@amazon.com

More Related Content

What's hot

Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Amazon Web Services
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Amazon Web Services
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Amazon Web Services
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Amazon Web Services
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper Vasu S
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Amazon Web Services
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Amazon Web Services
 
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018Amazon Web Services
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETLAmazon Web Services
 
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfBuilding-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfAmazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Amazon Web Services
 

What's hot (20)

Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
 
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdfBuilding-a-Modern-Data-Platform-in-the-Cloud.pdf
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
 

Similar to Data preparation and transformation - Spin your straw into gold - Tel Aviv Summit 2018

Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Amazon Web Services
 
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Amazon Web Services
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Amazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksAmazon Web Services
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Amazon Web Services
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftAmazon Web Services
 
Migrazione di Database e Data Warehouse su AWS
Migrazione di Database e Data Warehouse su AWSMigrazione di Database e Data Warehouse su AWS
Migrazione di Database e Data Warehouse su AWSAmazon Web Services
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeAmazon Web Services
 
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Amazon Web Services
 
Quickly and easily build, train, and deploy machine learning models at any scale
Quickly and easily build, train, and deploy machine learning models at any scaleQuickly and easily build, train, and deploy machine learning models at any scale
Quickly and easily build, train, and deploy machine learning models at any scaleAWS Germany
 
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data ArchitectureGet to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data ArchitectureAmazon Web Services
 
Non-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFNon-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFAmazon Web Services
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftAmazon Web Services
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...Amazon Web Services
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Web Services
 

Similar to Data preparation and transformation - Spin your straw into gold - Tel Aviv Summit 2018 (20)

Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech Talks
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
 
Migrating database to cloud
Migrating database to cloudMigrating database to cloud
Migrating database to cloud
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
 
Migrazione di Database e Data Warehouse su AWS
Migrazione di Database e Data Warehouse su AWSMigrazione di Database e Data Warehouse su AWS
Migrazione di Database e Data Warehouse su AWS
 
Non-Relational Revolution
Non-Relational RevolutionNon-Relational Revolution
Non-Relational Revolution
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_Singapore
 
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
 
Quickly and easily build, train, and deploy machine learning models at any scale
Quickly and easily build, train, and deploy machine learning models at any scaleQuickly and easily build, train, and deploy machine learning models at any scale
Quickly and easily build, train, and deploy machine learning models at any scale
 
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data ArchitectureGet to Know Your Customers - Build and Innovate with a Modern Data Architecture
Get to Know Your Customers - Build and Innovate with a Modern Data Architecture
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Non-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFNon-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SF
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
BI & Analytics
BI & AnalyticsBI & Analytics
BI & Analytics
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Data preparation and transformation - Spin your straw into gold - Tel Aviv Summit 2018

  • 1. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Daniel Haviv Analytics Specialist Solutions Architect, Amazon Web Services Director Of Data, Datorama Data preparation and transformation: Spin your straw into gold Raanan Raz VP R&D, Datorama Uri Sherman
  • 2. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Agenda Challenges & Requirements Amazon EMR - Introduction Customer Story - Dataorama AWS Glue - Introduction Demo
  • 3. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. ConsumeStore Process & AnalyzeIngest Kinesis Data Streams Kinesis Firehose Delivery Streams DynamoDB AWS Lambda Kinesis Analytics Raw Bucket Parquet Bucket Athena Redshift Spectrum QuickSight SpeedLayerBatchLayer Glue Data Catalog Spark/EMR Glue ETL Real time Web UI
  • 4. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Challenges & Requirements Data • Massive storage capabilities • Massive Parallel Processing engine • Ability to handle flexible (if any) schema Tools • In-depth knowledge and experience with the technology • Operational effort (install/configure/ maintain /upgrade) Skills • Too complex for human use • Raw data <> consumable data • Different format and schema requirements for different teams
  • 5. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Amazon EMR
  • 6. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. PIG SQL Amazon EMR Amazon S3 Hadoop ecosystem
  • 7. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Automation Decouple Elastic Integration Low-costCurrent
  • 8. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Automation EC2 Provisioning Cluster Setup Hadoop Configuration Installing ApplicationsJob submissionMonitoring and Failure Handling
  • 9. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Automation Decouple Elastic Integration Low-costCurrent
  • 10. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Decouple Storage and Compute Persistent Cluster – Interactive Queries (Spark-SQL | Presto) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes Workload specific clusters (Different sizes, Different Versions) Amazon S3
  • 11. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? – Compute Flexibility Compute Memory Storage Machine Learning C4 Family C5 Family X1 Family R3 Family Interactive Analysis D2 Family I3 Family Large HDFS General Batch Process M4 Family
  • 12. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Automation Decouple Elastic Integration Low-costCurrent
  • 13. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Elastic Scale Out Scale In Auto Scale
  • 14. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Automation Decouple Elastic Integration Low-costCurrent
  • 15. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Automation Decouple Elastic Integration Low-costCurrent
  • 16. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Current Application Open source release EMR release Spark 1.5 September 9, 2015 September 2015 Spark 1.5.2 November 9, 2015 November 2015 Spark 1.6 January 4, 2016 January 2016 Spark 1.6.1 March 9, 2016 April 4, 2016 Spark 2.0 July 26, 2016 August 2, 2016 Spark 2.0.2 November 14, 2016 November 21, 2016 Spark 2.1.0 December 28, 2016 January 26, 2017 Spark 2.2.0 July 11, 2017 August 10, 2017 Spark 2.2.1 December 01, 2017 January 22, 2018
  • 17. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
  • 18. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Automation Decouple Elastic Integration Low-costCurrent
  • 19. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why EMR? Low-cost Spot instancesTransient clusters Reserved instances
  • 20. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Cost & Time # CPUs Time # CPUs Time Wall clock time: 1 hourWall clock time: 10 hours
  • 21. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Datorama
  • 22. Datorama: Marketing Intelligence Raanan Raz, VP R&D Uri Sherman, Director of Data March 14, 2018
  • 23. Founded in By Ran Sarig, Efi Cohen & Katrin Ribant 2012 +3000 Brands +300 Agencies +50 Publishers +25 Industry Verticals 16 Offices worldwide $50MFunding from Lightspeed Innovation Endeavors 300+ Employees & growing quickly
  • 24. Datorama is Intelligence for Marketing Every performance, outcome & investment across the customer journey – all in one place. So you can make smarter decisions.
  • 26. ~400 Servers 4 Geo Locations >40 Microservices ~1 PB Raw Data 5B Daily Events Processed 5M Daily Analytical Queries 1 Petabyte Growth in numbers
  • 28. It’s a fragmented marketing world Performance Impact Loyalty+ MARKETER Growth BRAND BRAND REGION REGION REGION REGION CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CAMPAIGN CHANNELCHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNEL CHANNELCHANNEL CHANNEL CHANNEL
  • 29. and it’s not stopping… (+5000 Platforms)
  • 32. Transformations at scale • Extract and Transform data • Calculated columns • Vlookups/Fuzzy match • Complex logic and iterations • Sandboxed environment
  • 33. Marketing data is NOT immutable • External vendors have windows of reconciliations (up to 6 months) • Our users want to update/delete specific rows/set • Our users love to backdate • Most (if not all) big data solutions are append only and updating the data is considered a heavy process
  • 35. • Fact data • Batch uploads (up to 1 billion rows per file) – Updates existing data, Transactional, High throughput • Interactive Sql queries • Highly scalable Product Scope ? SQLCSVCSVCSV SQL SQL
  • 36. Architecture Overview EMR EMR Loader Service Query Service csv orc Hive Metastore Amazon S3 upload-req query result-set
  • 37. Storage Layout - Date Partitioning /my_table /20140313 /part01.orc /part02.orc /part03.orc /20140314 /part01.orc /part02.orc Amazon S3
  • 38. Read and increment table upload id1 Read input file2 Read “to be updated” partitions from S33 Merge the two dataframes4 Reclaim stale data offline, periodically 7 Update hive ALTER TABLE table_name [PARTITION date=’20180314’] SET LOCATION "/20180314_27"; 6 Write out partitions to new locations e.g. /20180314_27 5 Atomic Update Flow
  • 39. • Load / Query / Storage are completely decoupled • Linear scale out • L microservice is the driver program – Single spark context per microservice instance Important Notes
  • 41. Raanan Raz, VP R&D raanan@datorama.com Uri Sherman, Director of Data uri@datorama.com Thank You!
  • 42. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Serverless ETL
  • 43. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. AWS Glue – Overview § Hive Metastore compatible with enhanced functionality § Crawlers automatically extracts metadata and creates tables § Integrated with Amazon Athena, Amazon Redshift Spectrum § Run jobs on a serverless Spark platform § Provides flexible scheduling § Handles dependency resolution, monitoring and alerting § Auto-generates ETL code § Build on open frameworks – Spark: Python & Scala § Developer Endpoint with Interactive Notebook Job Authoring Job Execution Data Catalog
  • 44. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. AWS Glue – Developer Endpoint Explore, visualize and develop using a personal, serverless environment with interactive REPL and Notebooks.
  • 45. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Move data across storage systems Unified view
  • 46. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Demo
  • 47. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. dhaviv@amazon.com