Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
The fashion shopping future
Metail's Data Pipeline and
AWS
OCTOBER 2015
2
Introduction
• Introduction to Metail (from BD shiny)
• Architecture Overview
• Event Tracking and Collection
• Extract ...
3
The Metail Experience allows customer to…
Discover clothes on
your body shape
Create, save outfits
and share
Shop with
c...
4
1.6m MeModels created
Size & scale
5
+
-
88 Countries
Size & scale
6
Architecture Overview
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.n...
7
Architecture Overview
• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-
architecture.n...
8
New Data and Collection
9
New Data and Collection
Batch Layer
10
New Data and Collection
Batch Layer
Serving Layer
11
Data Collection
• A significant part of our pipeline is powered by Snowplow:
http://snowplowanalytics.com
• We use thei...
12
Data Collection
• Where does AWS come in?
– Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Se...
13
Extract Transform and Load (ETL)
• This is the batch layer of our architecture
• Runs over the raw (and enriched) data ...
14
Extract Transform and Load (ETL)
• Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner
– I...
15
Extract Transform and Load (ETL)
• Metail’s ETL
– We run directly on the data in S3
– We store our JARs in S3 and have ...
16
Extract Transform and Load (ETL)
• EMR and S3 get on very well
– AWS have engineered S3 so that it can behave as a nati...
17
Getting Insights
• The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/
– I’d like it to b...
18
Getting Insights
SELECT DATE_TRUNC('mon', collector_tstamp),
COUNT(event_id)
FROM events
GROUP BY DATE_TRUNC('mon', col...
19
Getting Insights
• The Snowplow pipeline is setup to have Redshift as an endpoint:
https://github.com/snowplow/snowplow...
20
Getting Insights
• A technology called Looker …
– This provides a powerful Excel like interface to the data
– While pro...
21
Managing the Pipeline
• I’ve almost certainly run out of time and not reached this slide 
• Lemur to submit ad-hoc Cas...
Upcoming SlideShare
Loading in …5
×

Metail at Cambridge AWS User Group Main Meetup #3

1,259 views

Published on

An overview of Metail's data pipeline architecture with particular emphasis on our event collection and batch processing layer, based on Hadoop. A brief overview of our use of business intelligence and data science sections is also given.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Metail at Cambridge AWS User Group Main Meetup #3

  1. 1. 1 The fashion shopping future Metail's Data Pipeline and AWS OCTOBER 2015
  2. 2. 2 Introduction • Introduction to Metail (from BD shiny) • Architecture Overview • Event Tracking and Collection • Extract Transform and Load (ETL) • Getting Insights • Managing The Pipeline
  3. 3. 3 The Metail Experience allows customer to… Discover clothes on your body shape Create, save outfits and share Shop with confidence of size and fit
  4. 4. 4 1.6m MeModels created Size & scale
  5. 5. 5 + - 88 Countries Size & scale
  6. 6. 6 Architecture Overview • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net
  7. 7. 7 Architecture Overview • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net
  8. 8. 8 New Data and Collection
  9. 9. 9 New Data and Collection Batch Layer
  10. 10. 10 New Data and Collection Batch Layer Serving Layer
  11. 11. 11 Data Collection • A significant part of our pipeline is powered by Snowplow: http://snowplowanalytics.com • We use their technology for tracking and setup for collection – They have specified a tracking protocol, implementing it in many languages – We’re using the JavaScript tracker – Implementation very similar to Google Analytics (GA): http://www.google.co.uk/analytics/ – But you have all the raw data 
  12. 12. 12 Data Collection • Where does AWS come in? – Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Setting- up-the-Cloudfront-collector – Snowplow’s GIF, called i, we uploaded to an S3 bucket – Cloudfront serves the content of the bucket – To collect the events the tracker performs a GET request – Query parameters of the GET request contain the payload – E.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&... – Cloudfront configured for http and https for only GET and HEAD with logging enabled – Cloudfront requests, the events, are logged to our S3 bucket  – In Lambda Architecture terms these Cloudfront logs are our master record and are the raw data
  13. 13. 13 Extract Transform and Load (ETL) • This is the batch layer of our architecture • Runs over the raw (and enriched) data producing (further) enriched data sets • Implemented using MapReduce technologies: – Snowplow ETL written in Scalding – Cascading (Java higher level MapReduce libraries) in Scala https://github.com/twitter/scalding + http://www.cascading.org/ – Looks like Scala and Cascading – Metail ETL written in Cascalog: http://cascalog.org – Cascalog has been described as logic programming over Hadoop – Cascading + Datalog = Cascalog – Ridiculously compact and expressive – one of the steepest learning curve I’ve encountered in software engineering but no hidden traps – AWS’s Elastic MapReduce (EMR) https://aws.amazon.com/elasticmapreduce/ – AWS has done the hard/tedious work of deploying Hadoop to EC2
  14. 14. 14 Extract Transform and Load (ETL) • Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner – Initial step executed outside of EMR – Copy data in Cloudfront incoming log bucket to another S3 bucket for processing – Next create EMR cluster – To that cluster you add steps
  15. 15. 15 Extract Transform and Load (ETL) • Metail’s ETL – We run directly on the data in S3 – We store our JARs in S3 and have a process to deploy them – We have several enrichment steps – Our enrichment runs on Snowplow’s enriched events – And further enrich our enriched events – This is what is building our batch views for the serving layer
  16. 16. 16 Extract Transform and Load (ETL) • EMR and S3 get on very well – AWS have engineered S3 so that it can behave as a native HDFS file system with very little loss of performance – They recommend using S3 as permanent data store – EMR cluster’s HDFS file system in my mind is a giant /tmp – Encourages immutable infrastructure – You don’t need your compute cluster running to hold your data – Snowplow and Metail output directly to S3 – The only reason Snowplow copies to local HDFS is because they’re aggregating the Cloudfront logs – That’s transitory data – You can archive S3 data to Glacier
  17. 17. 17 Getting Insights • The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/ – I’d like it to be Cascalog but even I’d hate that :P • Redshift is a “petabyte-scale data warehouse” – Offers a Postgres like SQL dialect to query the data – Uses a columnar distributed data store – It’s very quick – Currently we have a nine node compute cluster (9*160GB = 1.44TB) – Thinking of switching to dense storage node or re-architecting – Growing at 10GB a day
  18. 18. 18 Getting Insights SELECT DATE_TRUNC('mon', collector_tstamp), COUNT(event_id) FROM events GROUP BY DATE_TRUNC('mon', collector_tstamp) ORDER BY DATE_TRUNC('mon', collector_tstamp);
  19. 19. 19 Getting Insights • The Snowplow pipeline is setup to have Redshift as an endpoint: https://github.com/snowplow/snowplow/wiki/setting-up-redshift • The Snowplow events table is loaded into Redshift directly from S3 • The events we enrich in EMR are also loaded into Redshift again directly from S3
  20. 20. 20 Getting Insights • A technology called Looker … – This provides a powerful Excel like interface to the data – While providing software engineering tools to manage the SQL used explore the data • .. and R for the heavier stats – Starting to interface directly to Redshift through a PostgreSQL driver The analysis of this data is done using a combination of
  21. 21. 21 Managing the Pipeline • I’ve almost certainly run out of time and not reached this slide  • Lemur to submit ad-hoc Cascalog jobs – The initial manual pipeline – Clojure based • Snowplow have written their configuration tools in Ruby and bash • We use AWS’s Data Pipeline: https://aws.amazon.com/datapipeline/ – More flaws than advantages

×