Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data pipelines from zero


Published on

The presentation aims to demystify the practice of building reliable data processing pipelines. It includes a brief overview of the pieces needed to build a stable processing platform: data ingestion,processing engines, workflow management, and schemas. For each component, suitable components are suggested, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.

Original document:

Published in: Software
  • Be the first to comment

Data pipelines from zero

  1. 1. Data pipelines from zero Lars Albertsson Data architect @ Schibsted 1
  2. 2. Who’s talking? Swedish Institute of Computer Science (test tools) Sun Microsystems (very large machines) Google (Hangouts, productivity) Recorded Future (NLP startup) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling) Schibsted (data processing & modelling) 2
  3. 3. Presentation goals Overview of data pipelines for analytics / data products Target audience: Big data starters Overview of necessary components Base recipe In vicinity of state-of-practice Baseline for comparing design proposals Subjective best practices Technology suggestions, (alternatives) 3
  4. 4. Data product anatomy 4 Cluster storage Ingress Unified log ETL Egress DB DB DB Service DatasetJob Pipeline Service Export Business intelligence
  5. 5. Cluster storage HDFS (NFS, S3, Google CS, C*) Event collection 5 Unified log Immutable events Append-only Source of truth Service Unreliable Unreliable Reliable, write available Kafka (Kinesis, Google Pub/Sub) Secor, Camus Immediate handoff to append-only replicated log. Don’t manipulate, shuffle, sort, demux. Add timestamps.
  6. 6. Database state collection Do: Read snapshots, event conversion tools (Aegisthus, Bottled Water) Careful: Dump replicated slave Don’t: Use API, dump live master 6 Cluster storage HDFS (NFS, S3, Google CS, C*) Service DB DB backup Service
  7. 7. Datasets 7 hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json Hadoop + Hive name conventions Instance = class + parameters, same schema Immutable Dataset class Instance parameters, Hive convention Seal PartitionsPrivacy level Schema version
  8. 8. Pipelines Dataset “build system” Input will be missing Jobs will fail Jobs will have bugs Dataset = function([inputs], code) Deterministic, idempotent 8 Cluster storage Unified log Pristine, immutable datasets Intermediate Derived, regenerable
  9. 9. Luigi, (Airflow, Oozie) Workflow manager Dataset “build tool” Build when input is available Backfill for previous failures Rebuild for bugs => Eventual correctness DSL describes DAG Includes egress Data retention, privacy audit 9 DB
  10. 10. Batch processing MVP Start simple, lean, end-to-end, without Hadoop/Spark Serial jobs on pool of machines + work queue Downsample to fit one machine if necessary (Local Spark, Scalding, Crunch, Akka reactive streams) Get end-to-end workflows in production for trial Integration test end-to-end semantics Ensure developer productivity - code/test cycle 10
  11. 11. Processing at scale Parallelise jobs only when forced to do so Spark, (Hadoop + Scalding / Crunch) Avoid: Vanilla MapReduce, non-JVM Most jobs fit in single machine Big complexity + performance win 11
  12. 12. Schemas Storage formats: Json, Avro, Parquet. Protobuf, Thrift There is always a schema, implicit or explicit Schema on read Dynamic typing, quick schema changes Schema on write Static typing possible Use schema on read for analytics. Incompatible change? New dataset class. 12
  13. 13. Egress datasets Serving Cassandra, denormalised Export & Analytics SQL Workbenches (Zeppelin) (Elasticsearch, proprietary OLAP) 13
  14. 14. Parting words Keep things simple. Batch, few components & little state. Don’t drop incoming data. Focus on developer code, test, debug cycle - end to end. Expect, tolerate human error. Harmony with technical ecosystems - follow tech leaders. Scalability only when necessary. Plan early: Privacy, retention, audit, schema evolution. 14
  15. 15. Bonus slides 15
  16. 16. +Operations +Security +Responsive scaling - Development workflows - Privacy - Vendor lock-in Cloud or not?
  17. 17. Data pipelines example 17 Users Page views Sales Sales reports Views with demographics Sales with demographics Conversion analytics Conversion analytics Views with demographics Raw Derived
  18. 18. Form teams that are driven by business cases & need Forward-oriented -> filters implicitly applied Beware of: duplication, tech chaos/autonomy, privacy loss Data pipelines team organisation
  19. 19. Conway’s law “Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.” Better organise to match desired design, then.