An overview of Metail's data pipeline architecture with particular emphasis on our event collection and batch processing layer, based on Hadoop. A brief overview of our use of business intelligence and data science sections is also given.
11. 11
Data Collection
• A significant part of our pipeline is powered by Snowplow:
http://snowplowanalytics.com
• We use their technology for tracking and setup for collection
– They have specified a tracking protocol, implementing it in many languages
– We’re using the JavaScript tracker
– Implementation very similar to Google Analytics (GA):
http://www.google.co.uk/analytics/
– But you have all the raw data
12. 12
Data Collection
• Where does AWS come in?
– Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Setting-
up-the-Cloudfront-collector
– Snowplow’s GIF, called i, we uploaded to an S3 bucket
– Cloudfront serves the content of the bucket
– To collect the events the tracker performs a GET request
– Query parameters of the GET request contain the payload
– E.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...
– Cloudfront configured for http and https for only GET and HEAD with logging enabled
– Cloudfront requests, the events, are logged to our S3 bucket
– In Lambda Architecture terms these Cloudfront logs are our master record and are
the raw data
13. 13
Extract Transform and Load (ETL)
• This is the batch layer of our architecture
• Runs over the raw (and enriched) data producing (further) enriched data sets
• Implemented using MapReduce technologies:
– Snowplow ETL written in Scalding
– Cascading (Java higher level MapReduce libraries) in Scala
https://github.com/twitter/scalding + http://www.cascading.org/
– Looks like Scala and Cascading
– Metail ETL written in Cascalog: http://cascalog.org
– Cascalog has been described as logic programming over Hadoop
– Cascading + Datalog = Cascalog
– Ridiculously compact and expressive – one of the steepest learning curve I’ve
encountered in software engineering but no hidden traps
– AWS’s Elastic MapReduce (EMR) https://aws.amazon.com/elasticmapreduce/
– AWS has done the hard/tedious work of deploying Hadoop to EC2
14. 14
Extract Transform and Load (ETL)
• Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner
– Initial step executed outside of EMR
– Copy data in Cloudfront incoming log bucket to another S3 bucket for processing
– Next create EMR cluster
– To that cluster you add steps
15. 15
Extract Transform and Load (ETL)
• Metail’s ETL
– We run directly on the data in S3
– We store our JARs in S3 and have a process to deploy them
– We have several enrichment steps
– Our enrichment runs on Snowplow’s enriched events
– And further enrich our enriched events
– This is what is building our batch views for the serving layer
16. 16
Extract Transform and Load (ETL)
• EMR and S3 get on very well
– AWS have engineered S3 so that it can behave as a native HDFS file system with very
little loss of performance
– They recommend using S3 as permanent data store
– EMR cluster’s HDFS file system in my mind is a giant /tmp
– Encourages immutable infrastructure
– You don’t need your compute cluster running to hold your data
– Snowplow and Metail output directly to S3
– The only reason Snowplow copies to local HDFS is because they’re aggregating
the Cloudfront logs
– That’s transitory data
– You can archive S3 data to Glacier
17. 17
Getting Insights
• The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/
– I’d like it to be Cascalog but even I’d hate that :P
• Redshift is a “petabyte-scale data warehouse”
– Offers a Postgres like SQL dialect to query the data
– Uses a columnar distributed data store
– It’s very quick
– Currently we have a nine node compute cluster (9*160GB = 1.44TB)
– Thinking of switching to dense storage node or re-architecting
– Growing at 10GB a day
18. 18
Getting Insights
SELECT DATE_TRUNC('mon', collector_tstamp),
COUNT(event_id)
FROM events
GROUP BY DATE_TRUNC('mon', collector_tstamp)
ORDER BY DATE_TRUNC('mon', collector_tstamp);
19. 19
Getting Insights
• The Snowplow pipeline is setup to have Redshift as an endpoint:
https://github.com/snowplow/snowplow/wiki/setting-up-redshift
• The Snowplow events table is loaded into Redshift directly from S3
• The events we enrich in EMR are also loaded into Redshift again directly from S3
20. 20
Getting Insights
• A technology called Looker …
– This provides a powerful Excel like interface to the data
– While providing software engineering tools to manage the SQL used explore the data
• .. and R for the heavier stats
– Starting to interface directly to Redshift through a PostgreSQL driver
The analysis of this data is done using a combination of
21. 21
Managing the Pipeline
• I’ve almost certainly run out of time and not reached this slide
• Lemur to submit ad-hoc Cascalog jobs
– The initial manual pipeline
– Clojure based
• Snowplow have written their configuration tools in Ruby and bash
• We use AWS’s Data Pipeline: https://aws.amazon.com/datapipeline/
– More flaws than advantages