Data pipelines from zero

Data pipelines from zero
Lars Albertsson
Data architect @ Schibsted
www.mapflat.com
1

Who’s talking?
Swedish Institute of Computer Science (test tools)
Sun Microsystems (very large machines)
Google (Hangouts, productivity)
Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling)
Schibsted (data processing & modelling)
2

Presentation goals
Overview of data pipelines for analytics / data products
Target audience: Big data starters
Overview of necessary components
Base recipe
In vicinity of state-of-practice
Baseline for comparing design proposals
Subjective best practices
Technology suggestions, (alternatives)
3

Data product anatomy
4
Cluster storage
Ingress
Unified log
ETL Egress
DB
DB
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence

Cluster storage
HDFS
(NFS, S3, Google CS, C*)
Event collection
5
Unified log
Immutable events
Append-only
Source of truth
Service
Unreliable
Unreliable
Reliable,
write available
Kafka
(Kinesis,
Google Pub/Sub)
Secor,
Camus
Immediate handoff to append-only replicated log.
Don’t manipulate, shuffle, sort, demux. Add timestamps.

Database state collection
Do: Read snapshots, event conversion tools
(Aegisthus, Bottled Water)
Careful: Dump replicated slave
Don’t: Use API, dump live master
6
Cluster storage
HDFS
(NFS, S3, Google CS, C*)
Service
DB
DB backup
Service

Datasets
7
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS
part-00000.json
part-00001.json
Hadoop + Hive name conventions
Instance = class + parameters, same schema
Immutable
Dataset
class
Instance parameters,
Hive convention
Seal PartitionsPrivacy
level
Schema
version

Pipelines
Dataset “build system”
Input will be missing
Jobs will fail
Jobs will have bugs
Dataset =
function([inputs], code)
Deterministic, idempotent
8
Cluster storage
Unified log
Pristine,
immutable
datasets
Intermediate
Derived,
regenerable

Luigi, (Airflow, Oozie)
Workflow manager
Dataset “build tool”
Build when input is available
Backfill for previous failures
Rebuild for bugs
=> Eventual correctness
DSL describes DAG
Includes egress
Data retention, privacy audit
9
DB

Batch processing MVP
Start simple, lean, end-to-end, without Hadoop/Spark
Serial jobs on pool of machines + work queue
Downsample to fit one machine if necessary
(Local Spark, Scalding, Crunch, Akka reactive
streams)
Get end-to-end workflows in production for trial
Integration test end-to-end semantics
Ensure developer productivity - code/test cycle
10

Processing at scale
Parallelise jobs only when forced to do so
Spark, (Hadoop + Scalding / Crunch)
Avoid: Vanilla MapReduce, non-JVM
Most jobs fit in single machine
Big complexity + performance win
11

Schemas
Storage formats: Json, Avro, Parquet. Protobuf, Thrift
There is always a schema, implicit or explicit
Schema on read
Dynamic typing, quick schema changes
Schema on write
Static typing possible
Use schema on read for analytics.
Incompatible change? New dataset class.
12

Egress datasets
Serving
Cassandra, denormalised
Export & Analytics
SQL
Workbenches (Zeppelin)
(Elasticsearch, proprietary OLAP)
13

Parting words
Keep things simple. Batch, few components & little state.
Don’t drop incoming data.
Focus on developer code, test, debug cycle - end to end.
Expect, tolerate human error.
Harmony with technical ecosystems - follow tech leaders.
Scalability only when necessary.
Plan early: Privacy, retention, audit, schema evolution.
14

+Operations
+Security
+Responsive scaling
- Development workflows
- Privacy
- Vendor lock-in
Cloud or not?

Data pipelines example
17
Users
Page
views
Sales
Sales
reports
Views with
demographics
Sales with
demographics
Conversion
analytics
Conversion
analytics
Views with
demographics
Raw Derived

Form teams that are driven by business cases & need
Forward-oriented -> filters implicitly applied
Beware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines team organisation

Conway’s law
“Organizations which design systems ... are
constrained to produce designs which are
copies of the communication structures of
these organizations.”
Better organise to match desired design, then.

Data pipelines from zero

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Data pipelines from zero

Similar to Data pipelines from zero (20)

More from Lars Albertsson

More from Lars Albertsson (20)

Recently uploaded

Recently uploaded (20)

Data pipelines from zero