Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

LSUG talk - Building data processing apps in Scala, the Snowplow experience


Published on

I talked to the London Scala Users' Group about building Snowplow, an open source event analytics platform, on top of Scala and key libraries and frameworks including Scalding, Scalaz and Spray. He will highlight some of the data processing tricks and techniques picked up along the way, particularly: schema-first development; monadic ETL; datatable-based testing; data transformation maps. He will also introduce some of the Scala libraries the Snowplow team have open sourced along the way (such as scala-forex, referer-parser, scala-maxmind-geoip).

Published in: Technology
  • Be the first to comment

  • Be the first to like this

LSUG talk - Building data processing apps in Scala, the Snowplow experience

  1. 1. Building data processing apps in Scala: the Snowplow experience London Scala Users’ Group
  2. 2. Building data processing apps in Scala 1. Snowplow – what is it? 2. Snowplow and Scala 3. Deep dive into our Scala code 4. Modularization and non-Snowplow code you can use 5. Roadmap 6. Questions 7. Appendix: even more roadmap
  3. 3. Snowplow – what is it?
  4. 4. Today, Snowplow is primarily an open source web analytics platform Snowplow: data pipeline Website / webapp Amazon S3 Collect Transform and enrich Amazon Redshift / PostgreSQL • Your granular, event-level and customer-level data, in your own data warehouse • Connect any analytics tool to your data • Join your web analytics data with any other data set
  5. 5. Snowplow was born out of our frustration with traditional web analytics tools… • Limited set of reports that don’t answer business questions • • • • Traffic levels by source Conversion levels Bounce rates Pages / visit • Web analytics tools don’t understand the entities that matter to business • Customers, intentions, behaviours, articles, videos, authors, subjects, services… • …vs pages, conversions, goals, clicks, transactions • Web analytics tools are siloed • Hard to integrate with other data sets incl. digital (marketing spend, ad server data), customer data (CRM), financial data (cost of goods, customer lifetime value)
  6. 6. …and out of the opportunities to tame new “big data” technologies These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
  7. 7. Snowplow is composed of a set of loosely coupled subsystems, architected to be robust and scalable 1. Trackers A 2. Collectors B 3. Enrich C 4. Storage D 5. Analytics Generate event data Receive data from trackers and log it to S3 Clean and enrich raw data Store data ready for analysis Examples: • JavaScript tracker • Python / Lua / No-JS / Arduino tracker Examples: • Cloudfront collector • Clojure collector for Amazon EB Built on Scalding / Cascading / Hadoop and powered by Amazon EMR Examples: • Amazon Redshift • PostgreSQL • Amazon S3 • Batch-based A D Standardised data protocols • Normally run overnight; sometimes every 4-6 hours
  8. 8. Snowplow and Scala
  9. 9. Our initial skunkworks version of Snowplow had no Scala  Snowplow data pipeline v1 Website / webapp JavaScript event tracker CloudFrontbased pixel collector HiveQL + Java UDF “ETL” Amazon S3
  10. 10. But our schema-first, loosely coupled approach made it possible to start swapping out existing components… Snowplow data pipeline v2 Website / webapp Amazon S3 CloudFrontbased event collector JavaScript event tracker or Scaldingbased enrichment Clojurebased event collector HiveQL + Java UDF “ETL” Amazon Redshift / PostgreSQL
  11. 11. What is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Scalding Cascalog cascading. jruby PyCascading Cascading Hive Java Hadoop MapReduce Hadoop DFS Pig
  12. 12. We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce
  13. 13. Why did we choose Scalding instead of one of the other Cascading DSLs/APIs? • Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojure when we started the project) • Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet • More controversial opinion (although maybe not at a Scala UG): we believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing
  14. 14. Strongly typed data pipelines – why? • Catch errors as soon as possible – and report them in a strongly typed way too • Define the inputs and outputs of each of your data processing steps in an unambiguous way • Forces you to formerly address the data types flowing through your system • Lets you write code like this:
  15. 15. Deep dive into our Scala code
  16. 16. The secret sauce for data processing in Scala: the Scalaz Validation (1/3) • Our basic processing model for Snowplow looks like this: Raw events Scalding enrichment process “Bad” raw events + reasons why they are bad “Good” enriched events • This fits incredibly well onto the Validation applicative functor from the Scalaz project
  17. 17. The secret sauce for data processing in Scala: the Scalaz Validation (2/3) • We were able to express our data flow in terms of some relatively simple types:
  18. 18. The secret sauce for data processing in Scala: the Scalaz Validation (3/3) • Scalaz Validation lets us do a variety of different validations and enrichments, and then collate the failures • This is really powerful!
  19. 19. On the testing side: we love Specs2 data tables… • They let us test a variety of inputs and expected outputs without making the mistake of just duplicating the data processing functionality in the test:
  20. 20. … and are starting to do more with ScalaCheck • ScalaCheck is a property-based testing framework, originally inspired by Haskell’s QuickCheck • We use it in a few places – including to generate unpredictable bad data and also to validate our new Thrift schema for raw Snowplow events:
  21. 21. Build and deployment: we have learnt to love (or at least peacefully co-exist with) SBT • .scala based SBT build, not .sbt • We use sbt assemble to create a fat jar for our Scalding ETL process – with some custom exclusions to play nicely on Amazon Elastic MapReduce • Deployment is incredibly easy compared to the pain we have had with our two Ruby instrumentation apps (EmrEtlRunner and StorageLoader)
  22. 22. Modularization and nonSnowplow code you can use
  23. 23. We try to make our validation and enrichment process as modular as possible • This encourages testability and re-use – also it widens the number of contributors vs this functionality being embedded in Snowplow • The Enrichment Manager uses external libraries (hosted in a Snowplow repository) which can be used in non-Snowplow projects: Enrichment Manager Not yet integrated
  24. 24. We also have a few standalone Scala projects which might be of interest • None of these projects assume that you are running Snowplow:
  25. 25. Snowplow roadmap
  26. 26. We want to move Snowplow to a unified log-based architecture CLOUD VENDOR / OWN DATA CENTER NARROW DATA SILOES Search SAAS VENDOR #1 SOME LOW LATENCY LOCAL LOOPS CMS Silo E-comm Silo APIs ERP Silo CRM Silo Streaming APIs / web hooks Eventstream SAAS VENDOR #2 Unified log Archiving Hadoop HIGH LATENCY < WIDE DATA COVERAGE > < FULL DATA HISTORY > Email marketing Ad hoc analytics Product rec’s Systems monitoring Management reporting Fraud detection Churn prevention LOW LATENCY
  27. 27. Again, our schema-first approach is letting us get to this architecture through a set of baby steps (1/2) • In 0.8.12 at the start of the year we performed some surgery to de-couple our core enrichment code from its Scalding harness: 0.8.12 pre-0.8.12 hadoop-etl scala-hadoopenrich scala-kinesis-enrich Record-level enrichment functionality scala-common-enrich
  28. 28. Then in 0.9.0 we released our first new Scala components leveraging Amazon Kinesis: Snowplow Trackers Scala Stream Collector • The parts in grey are still under development – we are working with Snowplow community members on these collaboratively Raw event stream S3 sink Kinesis app S3 Enrich Kinesis app Enriched event stream Redshift sink Kinesis app Redshift Bad raw events stream
  29. 29. Questions? @snowplowdata To have a coffee or beer and talk Scala/data – @alexcrdean or
  30. 30. Appendix: even more roadmap!
  31. 31. Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (1/3) • Our current approach involves a “Tracker Protocol” which is defined in a wiki page, processed in the Enrichment Manager and then written out to TSV files for loading into Redshift and Postgres (see over)
  32. 32. Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (3/3) • We are planning to replace the existing flow with a JSON Schema-driven approach: JSON Schema defining events 1. Define structure Raw events in JSON format 2. Validate events Enrichment Manager 3. Define structure Enriched events in Thrift or Arvo format 4. Drive shredding Shredder 5. Define structure Enriched events in TSV ready for loading into db