Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Growing a Data Pipeline for Analytics

203 views

Published on

Growing a Data Pipeline for Analytics

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Growing a Data Pipeline for Analytics

  1. 1. Growing a Data Pipeline for Analytics Roberto Vitillo, Staff Data Engineer @ Mozilla 26th PyData London Meetup
  2. 2. brew install apache-spark
  3. 3. Don’t do it yourself!
  4. 4. Input OutputETL Storage
  5. 5. JSON JSON?
  6. 6. JSON Parquet Spark, Hive, Pig …
  7. 7. JSON Parquet Spark, Hive, Pig … ???
  8. 8. “The easier it is to ask questions, the more questions will be asked”
  9. 9. Modern SQL supports Map, Arrays & Structs
  10. 10. JSON Parquet Spark, Hive, Pig … Presto, Re:dash
  11. 11. TLDR; • Don’t build your own pipeline unless you really have to • Use schemas • Exploit columnar storage • Use SQL

×