SlideShare a Scribd company logo
1 of 21
Download to read offline
1
The fashion shopping future
Metail's Data Pipeline and
AWS
OCTOBER 2015
2
Introduction
• Introduction to Metail (from BD shiny)
• Architecture Overview
• Event Tracking and Collection
• Extract Transform and Load (ETL)
• Getting Insights
• Managing The Pipeline
3
The Metail Experience allows customer to…
Discover clothes on
your body shape
Create, save outfits
and share
Shop with
confidence of size
and fit
4
1.6m MeModels created
Size & scale
5
+
-
88 Countries
Size & scale
6
Architecture Overview
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.net
7
Architecture Overview
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.net
8
New Data and Collection
9
New Data and Collection
Batch Layer
10
New Data and Collection
Batch Layer
Serving Layer
11
Data Collection
• A significant part of our pipeline is powered by Snowplow:
http://snowplowanalytics.com
• We use their technology for tracking and setup for collection
– They have specified a tracking protocol, implementing it in many languages
– We’re using the JavaScript tracker
– Implementation very similar to Google Analytics (GA):
http://www.google.co.uk/analytics/
– But you have all the raw data 
12
Data Collection
• Where does AWS come in?
– Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Setting-
up-the-Cloudfront-collector
– Snowplow’s GIF, called i, we uploaded to an S3 bucket
– Cloudfront serves the content of the bucket
– To collect the events the tracker performs a GET request
– Query parameters of the GET request contain the payload
– E.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...
– Cloudfront configured for http and https for only GET and HEAD with logging enabled
– Cloudfront requests, the events, are logged to our S3 bucket 
– In Lambda Architecture terms these Cloudfront logs are our master record and are
the raw data
13
Extract Transform and Load (ETL)
• This is the batch layer of our architecture
• Runs over the raw (and enriched) data producing (further) enriched data sets
• Implemented using MapReduce technologies:
– Snowplow ETL written in Scalding
– Cascading (Java higher level MapReduce libraries) in Scala
https://github.com/twitter/scalding + http://www.cascading.org/
– Looks like Scala and Cascading
– Metail ETL written in Cascalog: http://cascalog.org
– Cascalog has been described as logic programming over Hadoop
– Cascading + Datalog = Cascalog
– Ridiculously compact and expressive – one of the steepest learning curve I’ve
encountered in software engineering but no hidden traps
– AWS’s Elastic MapReduce (EMR) https://aws.amazon.com/elasticmapreduce/
– AWS has done the hard/tedious work of deploying Hadoop to EC2
14
Extract Transform and Load (ETL)
• Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner
– Initial step executed outside of EMR
– Copy data in Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster
– To that cluster you add steps
15
Extract Transform and Load (ETL)
• Metail’s ETL
– We run directly on the data in S3
– We store our JARs in S3 and have a process to deploy them
– We have several enrichment steps
– Our enrichment runs on Snowplow’s enriched events
– And further enrich our enriched events
– This is what is building our batch views for the serving layer
16
Extract Transform and Load (ETL)
• EMR and S3 get on very well
– AWS have engineered S3 so that it can behave as a native HDFS file system with very
little loss of performance
– They recommend using S3 as permanent data store
– EMR cluster’s HDFS file system in my mind is a giant /tmp
– Encourages immutable infrastructure
– You don’t need your compute cluster running to hold your data
– Snowplow and Metail output directly to S3
– The only reason Snowplow copies to local HDFS is because they’re aggregating
the Cloudfront logs
– That’s transitory data
– You can archive S3 data to Glacier
17
Getting Insights
• The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/
– I’d like it to be Cascalog but even I’d hate that :P
• Redshift is a “petabyte-scale data warehouse”
– Offers a Postgres like SQL dialect to query the data
– Uses a columnar distributed data store
– It’s very quick
– Currently we have a nine node compute cluster (9*160GB = 1.44TB)
– Thinking of switching to dense storage node or re-architecting
– Growing at 10GB a day
18
Getting Insights
SELECT DATE_TRUNC('mon', collector_tstamp),
COUNT(event_id)
FROM events
GROUP BY DATE_TRUNC('mon', collector_tstamp)
ORDER BY DATE_TRUNC('mon', collector_tstamp);
19
Getting Insights
• The Snowplow pipeline is setup to have Redshift as an endpoint:
https://github.com/snowplow/snowplow/wiki/setting-up-redshift
• The Snowplow events table is loaded into Redshift directly from S3
• The events we enrich in EMR are also loaded into Redshift again directly from S3
20
Getting Insights
• A technology called Looker …
– This provides a powerful Excel like interface to the data
– While providing software engineering tools to manage the SQL used explore the data
• .. and R for the heavier stats
– Starting to interface directly to Redshift through a PostgreSQL driver
The analysis of this data is done using a combination of
21
Managing the Pipeline
• I’ve almost certainly run out of time and not reached this slide 
• Lemur to submit ad-hoc Cascalog jobs
– The initial manual pipeline
– Clojure based
• Snowplow have written their configuration tools in Ruby and bash
• We use AWS’s Data Pipeline: https://aws.amazon.com/datapipeline/
– More flaws than advantages

More Related Content

What's hot

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 

What's hot (20)

Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a Service
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At Yelp
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational Scaling
 
Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flink
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 

Similar to Metail at Cambridge AWS User Group Main Meetup #3

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 

Similar to Metail at Cambridge AWS User Group Main Meetup #3 (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back again
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Cloud and Big Data trends
Cloud and Big Data trendsCloud and Big Data trends
Cloud and Big Data trends
 

Recently uploaded

Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
drm1699
 

Recently uploaded (20)

Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
Navigation in flutter – how to add stack, tab, and drawer navigators to your ...
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Encryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key ConceptsEncryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key Concepts
 
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4jGraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
 
GraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with Graph
GraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with GraphGraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with Graph
GraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with Graph
 

Metail at Cambridge AWS User Group Main Meetup #3

  • 1. 1 The fashion shopping future Metail's Data Pipeline and AWS OCTOBER 2015
  • 2. 2 Introduction • Introduction to Metail (from BD shiny) • Architecture Overview • Event Tracking and Collection • Extract Transform and Load (ETL) • Getting Insights • Managing The Pipeline
  • 3. 3 The Metail Experience allows customer to… Discover clothes on your body shape Create, save outfits and share Shop with confidence of size and fit
  • 6. 6 Architecture Overview • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net
  • 7. 7 Architecture Overview • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net
  • 8. 8 New Data and Collection
  • 9. 9 New Data and Collection Batch Layer
  • 10. 10 New Data and Collection Batch Layer Serving Layer
  • 11. 11 Data Collection • A significant part of our pipeline is powered by Snowplow: http://snowplowanalytics.com • We use their technology for tracking and setup for collection – They have specified a tracking protocol, implementing it in many languages – We’re using the JavaScript tracker – Implementation very similar to Google Analytics (GA): http://www.google.co.uk/analytics/ – But you have all the raw data 
  • 12. 12 Data Collection • Where does AWS come in? – Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Setting- up-the-Cloudfront-collector – Snowplow’s GIF, called i, we uploaded to an S3 bucket – Cloudfront serves the content of the bucket – To collect the events the tracker performs a GET request – Query parameters of the GET request contain the payload – E.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&... – Cloudfront configured for http and https for only GET and HEAD with logging enabled – Cloudfront requests, the events, are logged to our S3 bucket  – In Lambda Architecture terms these Cloudfront logs are our master record and are the raw data
  • 13. 13 Extract Transform and Load (ETL) • This is the batch layer of our architecture • Runs over the raw (and enriched) data producing (further) enriched data sets • Implemented using MapReduce technologies: – Snowplow ETL written in Scalding – Cascading (Java higher level MapReduce libraries) in Scala https://github.com/twitter/scalding + http://www.cascading.org/ – Looks like Scala and Cascading – Metail ETL written in Cascalog: http://cascalog.org – Cascalog has been described as logic programming over Hadoop – Cascading + Datalog = Cascalog – Ridiculously compact and expressive – one of the steepest learning curve I’ve encountered in software engineering but no hidden traps – AWS’s Elastic MapReduce (EMR) https://aws.amazon.com/elasticmapreduce/ – AWS has done the hard/tedious work of deploying Hadoop to EC2
  • 14. 14 Extract Transform and Load (ETL) • Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner – Initial step executed outside of EMR – Copy data in Cloudfront incoming log bucket to another S3 bucket for processing – Next create EMR cluster – To that cluster you add steps
  • 15. 15 Extract Transform and Load (ETL) • Metail’s ETL – We run directly on the data in S3 – We store our JARs in S3 and have a process to deploy them – We have several enrichment steps – Our enrichment runs on Snowplow’s enriched events – And further enrich our enriched events – This is what is building our batch views for the serving layer
  • 16. 16 Extract Transform and Load (ETL) • EMR and S3 get on very well – AWS have engineered S3 so that it can behave as a native HDFS file system with very little loss of performance – They recommend using S3 as permanent data store – EMR cluster’s HDFS file system in my mind is a giant /tmp – Encourages immutable infrastructure – You don’t need your compute cluster running to hold your data – Snowplow and Metail output directly to S3 – The only reason Snowplow copies to local HDFS is because they’re aggregating the Cloudfront logs – That’s transitory data – You can archive S3 data to Glacier
  • 17. 17 Getting Insights • The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/ – I’d like it to be Cascalog but even I’d hate that :P • Redshift is a “petabyte-scale data warehouse” – Offers a Postgres like SQL dialect to query the data – Uses a columnar distributed data store – It’s very quick – Currently we have a nine node compute cluster (9*160GB = 1.44TB) – Thinking of switching to dense storage node or re-architecting – Growing at 10GB a day
  • 18. 18 Getting Insights SELECT DATE_TRUNC('mon', collector_tstamp), COUNT(event_id) FROM events GROUP BY DATE_TRUNC('mon', collector_tstamp) ORDER BY DATE_TRUNC('mon', collector_tstamp);
  • 19. 19 Getting Insights • The Snowplow pipeline is setup to have Redshift as an endpoint: https://github.com/snowplow/snowplow/wiki/setting-up-redshift • The Snowplow events table is loaded into Redshift directly from S3 • The events we enrich in EMR are also loaded into Redshift again directly from S3
  • 20. 20 Getting Insights • A technology called Looker … – This provides a powerful Excel like interface to the data – While providing software engineering tools to manage the SQL used explore the data • .. and R for the heavier stats – Starting to interface directly to Redshift through a PostgreSQL driver The analysis of this data is done using a combination of
  • 21. 21 Managing the Pipeline • I’ve almost certainly run out of time and not reached this slide  • Lemur to submit ad-hoc Cascalog jobs – The initial manual pipeline – Clojure based • Snowplow have written their configuration tools in Ruby and bash • We use AWS’s Data Pipeline: https://aws.amazon.com/datapipeline/ – More flaws than advantages