Successfully reported this slideshow.
Your SlideShare is downloading. ×

Metail and Elastic MapReduce

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 15 Ad

Metail and Elastic MapReduce

Download to read offline

Metail allows users to discover clothes on their body shape online with minimum measurements from the user. With your avatar you can create outfits and coupled with our size advice this gives you a confidence in the size and fit.

I'm part of the team within Metail that has built a pipeline to collection, enriched and serve data to the company and our clients, and which has been used to validate Metail's product. This talk was given at the AWS Loft in London 21st April 2016 where I gave an overview of the end-to-end pipeline and then went into detail how we're using AWS' EMR to perform a batch processing of the collected data which is then served internally with Redshift.

Metail allows users to discover clothes on their body shape online with minimum measurements from the user. With your avatar you can create outfits and coupled with our size advice this gives you a confidence in the size and fit.

I'm part of the team within Metail that has built a pipeline to collection, enriched and serve data to the company and our clients, and which has been used to validate Metail's product. This talk was given at the AWS Loft in London 21st April 2016 where I gave an overview of the end-to-end pipeline and then went into detail how we're using AWS' EMR to perform a batch processing of the collected data which is then served internally with Redshift.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Metail and Elastic MapReduce (20)

Advertisement

Recently uploaded (20)

Metail and Elastic MapReduce

  1. 1. 1 April 2016 – AWS Loft, London Gareth Rogers, Data Engineer
  2. 2. 2 Metail lets you try on clothes online Discover clothes on your body shape Create, save outfits and share Shop with confidence of size and fit
  3. 3. 3 Proven impact as validated by American business schools and A/B tests ‘‘ …customers who had access to the fitting tool are more likely to come back to the site, and this effect is statistically significant… ‘‘ …shows approximately a 5.1 percent reduction in returns compared to the control group…In other words, providing fit information reduces average fulfilment costs” …sales for users with access to the tool were substantially higher overall - 22.32 percent larger ‘‘ Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment” by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21, 2015 DATA 1000+ GARMENTS POINTS3M
  4. 4. 4 Architecture Theory • Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda- architecture.net • Should include a speed layer to give a real time view on sampled data – We’ve not implemented this New Data Batch Layer Master dataset Serving Layer Batch views Query Query Query
  5. 5. 5 Architecture Practice – Data Collection
  6. 6. 6 Architecture Practice – Data Collection New Data and Collection • We’re using Snowplow for the initial stages of our pipeline • Using their JavaScript tracker and Cloudfront collector configuration • Tracker performs a GET request on a Cloudfront distributed image (pixel) • Query parameters of the contain the event data e.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&... • Cloudfront configured to log the requests to S3 • We now have our master record
  7. 7. 7 Architecture Practice – Serving Layer Serving Layer• Initially queries over Hadoop  Redshift came along  • RedshiftSQL good for small data science team! • Not so good for everyone else in the company • Introduced Looker  • Data model in SQL • Dashboards • Point and click data exploration • Permissions • Version control
  8. 8. 8 Architecture Practice – Batch Layer • Daily process the raw events to create batch view • Run using Elastic MapReduce (EMR) hosted Hadoop service in AWS • Create views of the master record through enrichment and aggregation • Populates the schema for speedy Redshift queries Batch Layer
  9. 9. 9 Extract Transform and Load (ETL) • Snowplow’s ETL driven by config files executed in Ruby – Initial step executed outside of EMR – Copy data from Cloudfront incoming log bucket to another S3 bucket for processing – Next create EMR cluster
  10. 10. 10 Extract Transform and Load (ETL) • Snowplow’s ETL driven by config files executed in Ruby – Initial step executed outside of EMR – Copy data from Cloudfront incoming log bucket to another S3 bucket for processing – Next create EMR cluster
  11. 11. 11 Extract Transform and Load (ETL) • To that cluster we add steps • Initial step use s3distcp to aggregate the log files • Snowplow’s ETL written in Scalding – Scalding = Cascading (Java higher level MapReduce libraries) in Scala – They provide a compiled JAR hosted in S3
  12. 12. 12 Extract Transform and Load (ETL) • Metail’s ETL is very similar to Snowplow’s • Use AWS’ Data Pipeline to drive the workflow – Really great to get going – But quickly hit complexity limitations
  13. 13. 13 Extract Transform and Load (ETL) • Metail ETL written in – Cascalog, logic programming over Hadoop – Cascalog = Cascading + Datalog in Clojure – Ridiculously compact and expressive – But steep learning curve and impenetrable errors
  14. 14. 14 Extract Transform and Load (ETL) • Soon Parkour a Clojure wrapper over Hadoop Java API – Access to full Hadoop API with no abstractions just more idiomatic Clojure – Learning curve is mainly Hadoop – Errors still impenetrable
  15. 15. 15 Summary • This pipeline has been built and managed by 3-5 people • It’s about a year and a half old and continues to evolve • Composed of a few different technologies and EMR used to do the batch processing • Using EMR has made cluster managing and scaling straightforward • The synergy between EMR and S3 is a powerful feature – Encourages immutable infrastructure – You don’t need your compute cluster running to hold your data!

×