SlideShare a Scribd company logo
1
April 2016 – AWS Loft, London
Gareth Rogers, Data Engineer
2
Metail lets you try on clothes online
Discover clothes on
your body shape
Create, save outfits
and share
Shop with confidence
of size and fit
3
Proven impact as validated by
American business schools and A/B tests
‘‘
…customers who had access to the fitting tool are more likely to come back to the
site, and this effect is statistically significant… ‘‘
…shows approximately a 5.1 percent reduction in returns compared to the
control group…In other words, providing fit information reduces average fulfilment costs”
…sales for users with access to the tool were substantially higher overall - 22.32 percent larger
‘‘
Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment”
by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio
Moreno (Northwestern University) –Oct 21, 2015
DATA
1000+ GARMENTS
POINTS3M
4
Architecture Theory
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.net
• Should include a speed layer to give a real time view on sampled data
– We’ve not implemented this
New
Data
Batch Layer
Master dataset
Serving Layer
Batch views
Query
Query
Query
5
Architecture Practice – Data Collection
6
Architecture Practice – Data Collection
New Data and Collection
• We’re using Snowplow for the initial stages of our pipeline
• Using their JavaScript tracker and Cloudfront collector
configuration
• Tracker performs a GET request on a Cloudfront distributed
image (pixel)
• Query parameters of the contain the event data e.g. GET
http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...
• Cloudfront configured to log the requests to S3
• We now have our master record
7
Architecture Practice – Serving Layer
Serving Layer• Initially queries over Hadoop  Redshift came along 
• RedshiftSQL good for small data science team!
• Not so good for everyone else in the company
• Introduced Looker 
• Data model in SQL
• Dashboards
• Point and click data exploration
• Permissions
• Version control
8
Architecture Practice – Batch Layer
• Daily process the raw events to create batch view
• Run using Elastic MapReduce (EMR) hosted Hadoop service in AWS
• Create views of the master record through enrichment and aggregation
• Populates the schema for speedy Redshift queries
Batch Layer
9
Extract Transform and Load (ETL)
• Snowplow’s ETL driven by config files executed in Ruby
– Initial step executed outside of EMR
– Copy data from Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster
10
Extract Transform and Load (ETL)
• Snowplow’s ETL driven by config files executed in Ruby
– Initial step executed outside of EMR
– Copy data from Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster
11
Extract Transform and Load (ETL)
• To that cluster we add steps
• Initial step use s3distcp to aggregate the log files
• Snowplow’s ETL written in Scalding
– Scalding = Cascading (Java higher level MapReduce libraries) in Scala
– They provide a compiled JAR hosted in S3
12
Extract Transform and Load (ETL)
• Metail’s ETL is very similar to
Snowplow’s
• Use AWS’ Data Pipeline to drive the
workflow
– Really great to get going
– But quickly hit complexity
limitations
13
Extract Transform and Load (ETL)
• Metail ETL written in
– Cascalog, logic programming over Hadoop
– Cascalog = Cascading + Datalog in Clojure
– Ridiculously compact and expressive
– But steep learning curve and impenetrable errors
14
Extract Transform and Load (ETL)
• Soon Parkour a Clojure wrapper over Hadoop Java API
– Access to full Hadoop API with no abstractions just more idiomatic Clojure
– Learning curve is mainly Hadoop
– Errors still impenetrable
15
Summary
• This pipeline has been built and managed by 3-5 people
• It’s about a year and a half old and continues to evolve
• Composed of a few different technologies and EMR used to do the batch processing
• Using EMR has made cluster managing and scaling straightforward
• The synergy between EMR and S3 is a powerful feature
– Encourages immutable infrastructure
– You don’t need your compute cluster running to hold your data!

More Related Content

What's hot

AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
Ahasan Habib
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
Wenrui Meng
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a Service
LN Renganarayana
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
Databricks
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
Zhenxiao Luo
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At Yelp
Paul O'Connor
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational Scaling
Justin Cunningham
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
Spark Summit
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
Manish Kumar
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien
 

What's hot (20)

AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a Service
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At Yelp
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational Scaling
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 

Similar to Metail and Elastic MapReduce

Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Flip Kromer
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
 
Cloud
CloudCloud
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
Marco Gralike
 
An AMIS overview of database 12c
An AMIS overview of database 12cAn AMIS overview of database 12c
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
Adrian Cockcroft
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Modern data warehouse presentation
Modern data warehouse presentationModern data warehouse presentation
Modern data warehouse presentation
David Rice
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
IRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New ContentIRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New Content
Martin Sykora
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
SnapLogic
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
Sneha Challa
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
Codecamp Romania
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
Lynn Langit
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
Lucas Jellema
 

Similar to Metail and Elastic MapReduce (20)

Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Cloud
CloudCloud
Cloud
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
 
An AMIS overview of database 12c
An AMIS overview of database 12cAn AMIS overview of database 12c
An AMIS overview of database 12c
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Modern data warehouse presentation
Modern data warehouse presentationModern data warehouse presentation
Modern data warehouse presentation
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
IRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New ContentIRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New Content
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 

Metail and Elastic MapReduce

  • 1. 1 April 2016 – AWS Loft, London Gareth Rogers, Data Engineer
  • 2. 2 Metail lets you try on clothes online Discover clothes on your body shape Create, save outfits and share Shop with confidence of size and fit
  • 3. 3 Proven impact as validated by American business schools and A/B tests ‘‘ …customers who had access to the fitting tool are more likely to come back to the site, and this effect is statistically significant… ‘‘ …shows approximately a 5.1 percent reduction in returns compared to the control group…In other words, providing fit information reduces average fulfilment costs” …sales for users with access to the tool were substantially higher overall - 22.32 percent larger ‘‘ Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment” by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21, 2015 DATA 1000+ GARMENTS POINTS3M
  • 4. 4 Architecture Theory • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net • Should include a speed layer to give a real time view on sampled data – We’ve not implemented this New Data Batch Layer Master dataset Serving Layer Batch views Query Query Query
  • 5. 5 Architecture Practice – Data Collection
  • 6. 6 Architecture Practice – Data Collection New Data and Collection • We’re using Snowplow for the initial stages of our pipeline • Using their JavaScript tracker and Cloudfront collector configuration • Tracker performs a GET request on a Cloudfront distributed image (pixel) • Query parameters of the contain the event data e.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&... • Cloudfront configured to log the requests to S3 • We now have our master record
  • 7. 7 Architecture Practice – Serving Layer Serving Layer• Initially queries over Hadoop  Redshift came along  • RedshiftSQL good for small data science team! • Not so good for everyone else in the company • Introduced Looker  • Data model in SQL • Dashboards • Point and click data exploration • Permissions • Version control
  • 8. 8 Architecture Practice – Batch Layer • Daily process the raw events to create batch view • Run using Elastic MapReduce (EMR) hosted Hadoop service in AWS • Create views of the master record through enrichment and aggregation • Populates the schema for speedy Redshift queries Batch Layer
  • 9. 9 Extract Transform and Load (ETL) • Snowplow’s ETL driven by config files executed in Ruby – Initial step executed outside of EMR – Copy data from Cloudfront incoming log bucket to another S3 bucket for processing – Next create EMR cluster
  • 10. 10 Extract Transform and Load (ETL) • Snowplow’s ETL driven by config files executed in Ruby – Initial step executed outside of EMR – Copy data from Cloudfront incoming log bucket to another S3 bucket for processing – Next create EMR cluster
  • 11. 11 Extract Transform and Load (ETL) • To that cluster we add steps • Initial step use s3distcp to aggregate the log files • Snowplow’s ETL written in Scalding – Scalding = Cascading (Java higher level MapReduce libraries) in Scala – They provide a compiled JAR hosted in S3
  • 12. 12 Extract Transform and Load (ETL) • Metail’s ETL is very similar to Snowplow’s • Use AWS’ Data Pipeline to drive the workflow – Really great to get going – But quickly hit complexity limitations
  • 13. 13 Extract Transform and Load (ETL) • Metail ETL written in – Cascalog, logic programming over Hadoop – Cascalog = Cascading + Datalog in Clojure – Ridiculously compact and expressive – But steep learning curve and impenetrable errors
  • 14. 14 Extract Transform and Load (ETL) • Soon Parkour a Clojure wrapper over Hadoop Java API – Access to full Hadoop API with no abstractions just more idiomatic Clojure – Learning curve is mainly Hadoop – Errors still impenetrable
  • 15. 15 Summary • This pipeline has been built and managed by 3-5 people • It’s about a year and a half old and continues to evolve • Composed of a few different technologies and EMR used to do the batch processing • Using EMR has made cluster managing and scaling straightforward • The synergy between EMR and S3 is a powerful feature – Encourages immutable infrastructure – You don’t need your compute cluster running to hold your data!