Successfully reported this slideshow.
Your SlideShare is downloading. ×

Metail at Cambridge AWS User Group Main Meetup #3

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Metail and Elastic MapReduce
Metail and Elastic MapReduce
Loading in …3
×

Check these out next

1 of 21 Ad

Metail at Cambridge AWS User Group Main Meetup #3

Download to read offline

An overview of Metail's data pipeline architecture with particular emphasis on our event collection and batch processing layer, based on Hadoop. A brief overview of our use of business intelligence and data science sections is also given.

An overview of Metail's data pipeline architecture with particular emphasis on our event collection and batch processing layer, based on Hadoop. A brief overview of our use of business intelligence and data science sections is also given.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Metail at Cambridge AWS User Group Main Meetup #3 (20)

Advertisement

Recently uploaded (20)

Metail at Cambridge AWS User Group Main Meetup #3

  1. 1. 1 The fashion shopping future Metail's Data Pipeline and AWS OCTOBER 2015
  2. 2. 2 Introduction • Introduction to Metail (from BD shiny) • Architecture Overview • Event Tracking and Collection • Extract Transform and Load (ETL) • Getting Insights • Managing The Pipeline
  3. 3. 3 The Metail Experience allows customer to… Discover clothes on your body shape Create, save outfits and share Shop with confidence of size and fit
  4. 4. 4 1.6m MeModels created Size & scale
  5. 5. 5 + - 88 Countries Size & scale
  6. 6. 6 Architecture Overview • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net
  7. 7. 7 Architecture Overview • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net
  8. 8. 8 New Data and Collection
  9. 9. 9 New Data and Collection Batch Layer
  10. 10. 10 New Data and Collection Batch Layer Serving Layer
  11. 11. 11 Data Collection • A significant part of our pipeline is powered by Snowplow: http://snowplowanalytics.com • We use their technology for tracking and setup for collection – They have specified a tracking protocol, implementing it in many languages – We’re using the JavaScript tracker – Implementation very similar to Google Analytics (GA): http://www.google.co.uk/analytics/ – But you have all the raw data 
  12. 12. 12 Data Collection • Where does AWS come in? – Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Setting- up-the-Cloudfront-collector – Snowplow’s GIF, called i, we uploaded to an S3 bucket – Cloudfront serves the content of the bucket – To collect the events the tracker performs a GET request – Query parameters of the GET request contain the payload – E.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&... – Cloudfront configured for http and https for only GET and HEAD with logging enabled – Cloudfront requests, the events, are logged to our S3 bucket  – In Lambda Architecture terms these Cloudfront logs are our master record and are the raw data
  13. 13. 13 Extract Transform and Load (ETL) • This is the batch layer of our architecture • Runs over the raw (and enriched) data producing (further) enriched data sets • Implemented using MapReduce technologies: – Snowplow ETL written in Scalding – Cascading (Java higher level MapReduce libraries) in Scala https://github.com/twitter/scalding + http://www.cascading.org/ – Looks like Scala and Cascading – Metail ETL written in Cascalog: http://cascalog.org – Cascalog has been described as logic programming over Hadoop – Cascading + Datalog = Cascalog – Ridiculously compact and expressive – one of the steepest learning curve I’ve encountered in software engineering but no hidden traps – AWS’s Elastic MapReduce (EMR) https://aws.amazon.com/elasticmapreduce/ – AWS has done the hard/tedious work of deploying Hadoop to EC2
  14. 14. 14 Extract Transform and Load (ETL) • Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner – Initial step executed outside of EMR – Copy data in Cloudfront incoming log bucket to another S3 bucket for processing – Next create EMR cluster – To that cluster you add steps
  15. 15. 15 Extract Transform and Load (ETL) • Metail’s ETL – We run directly on the data in S3 – We store our JARs in S3 and have a process to deploy them – We have several enrichment steps – Our enrichment runs on Snowplow’s enriched events – And further enrich our enriched events – This is what is building our batch views for the serving layer
  16. 16. 16 Extract Transform and Load (ETL) • EMR and S3 get on very well – AWS have engineered S3 so that it can behave as a native HDFS file system with very little loss of performance – They recommend using S3 as permanent data store – EMR cluster’s HDFS file system in my mind is a giant /tmp – Encourages immutable infrastructure – You don’t need your compute cluster running to hold your data – Snowplow and Metail output directly to S3 – The only reason Snowplow copies to local HDFS is because they’re aggregating the Cloudfront logs – That’s transitory data – You can archive S3 data to Glacier
  17. 17. 17 Getting Insights • The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/ – I’d like it to be Cascalog but even I’d hate that :P • Redshift is a “petabyte-scale data warehouse” – Offers a Postgres like SQL dialect to query the data – Uses a columnar distributed data store – It’s very quick – Currently we have a nine node compute cluster (9*160GB = 1.44TB) – Thinking of switching to dense storage node or re-architecting – Growing at 10GB a day
  18. 18. 18 Getting Insights SELECT DATE_TRUNC('mon', collector_tstamp), COUNT(event_id) FROM events GROUP BY DATE_TRUNC('mon', collector_tstamp) ORDER BY DATE_TRUNC('mon', collector_tstamp);
  19. 19. 19 Getting Insights • The Snowplow pipeline is setup to have Redshift as an endpoint: https://github.com/snowplow/snowplow/wiki/setting-up-redshift • The Snowplow events table is loaded into Redshift directly from S3 • The events we enrich in EMR are also loaded into Redshift again directly from S3
  20. 20. 20 Getting Insights • A technology called Looker … – This provides a powerful Excel like interface to the data – While providing software engineering tools to manage the SQL used explore the data • .. and R for the heavier stats – Starting to interface directly to Redshift through a PostgreSQL driver The analysis of this data is done using a combination of
  21. 21. 21 Managing the Pipeline • I’ve almost certainly run out of time and not reached this slide  • Lemur to submit ad-hoc Cascalog jobs – The initial manual pipeline – Clojure based • Snowplow have written their configuration tools in Ruby and bash • We use AWS’s Data Pipeline: https://aws.amazon.com/datapipeline/ – More flaws than advantages

×