Functional architectural
patterns
Lars Albertsson
1
Who’s talking?
Swedish Institute of Comp. Sc. (test tools)
Sun Microsystems (very large machines)
Google (Hangouts, productivity)
Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling)
Schibsted (data processing & modelling)
2
Why functional?
Verbs
... has made ... expanding ...
... flourishes ... merged ... has been unable to escape lingering .. built ...
... are ... placed ... say ... are ... to explode ...
.. are considering ... to reopen … to recall ...
3
Or object-oriented?
Nouns, pronouns
... bankruptcy ... government bailout ... automaker Chrysler ... comeback ... sales ... Jeep sport utility vehicles.
... Chrysler ... part ... Fiat Chrysler Automobiles, it ... concerns ... the safety ... Jeeps ...
... Jeeps ... gas tanks ... regulators ... safety advocates ... rear-end crash.
... regulators ... an investigation ... those Jeeps ... Fiat Chrysler’s agreement ... models.
4
Functional benefits? My version.
Matches a few problems
Data processing
Matches a few computer properties
Consistency through immutability
Deterministic - replay for resilience
5
Local vs distributed properties
Local
Hardware provides
strong consistency
Faults -> death
6
Distributed
Eventual consistency
Faults must be
survived
Architectural functional patterns
Personal anti-pattern experiences
Strive to look for
Immutability
Reexecution
7
MapReduce
Discovered pattern, not invention
Well known, enough said
Succeeded by Spark RDD paradigm
8
Data flows
9
Users
Page
views
Sales
Sales
reports
Views with
demographics
Sales with
demographics
Conversion
analytics
Conversion
analytics
Views with
demographics
Dataset artifacts, typically files with date
parameter.
Raw Derived
Anti-pattern - isolated batch jobs
Get data (more on that later)
Cron an ETL batch job (function)
Output solidifies. Mostly.
Steps in isolation - often different teams
What to do on ETL code changes?
10
Sales with
demographics
Views with
demographics
Pattern: data pipeline
End-to-end sequences/DAG of jobs
Not only exist, but treated end-to-end
Input is raw, original data
Separate raw data from generated
11
Users
Page
views
Sales with
demographics
Conversio
n analytics
Conversion
analytics
Views with
demographics
Lambda architecture, part 1
Save all collected data without preprocessing
But timestamp on generation, register,
arrival
Rerun everything downstream on code change
Human fault tolerance
In conflict with privacy management?
12
Pipeline workflow orchestration
Ideally: Good old make + cluster + IDE + xUnit
Test end-to-end
Rebuild on upstream changes (but not all)
State of practice: Luigi, Pinball, Azkaban
Don’t take you all the way :-(
13
Lambda architecture, part 2
Parallel batch and real-time pipelines
Batch more accurate, overrides
Real-time for window of recent data
14
Obtaining data
Log things. Conceptually stable, but collection
is challenging at scale.
Have legacy code and master data in
databases? Let us have a look.
15
Database dimensioned for online traffic
Hadoop = herd of elephants
Load spike
Height = #mapper nodes
Area = #users
Anti-pattern: direct dump
16
API
Direct dumps in the trenches
Company successful - #users increasing
More Sqoop mappers - higher DB load
Daily dump jobs went to 25h
Devops firewalled off Hadoop to recover
17
Anti-pattern: dump through API
SOA/microservice culture
DB protected by throttling
API not used to elephants
Query area is still large
Herd of elephants through gate - 1-2 weeks
18
API
Anti-pattern: slave dump
Protect live service by mirroring to a dump
slave
No online service risk, good!
Why anti-pattern?
19
All dumps are non-deterministic
HDFS down? Dump later.
State is gone - dump not accurate
Slave replication down?
Dump not accurate
20
Anti-pattern: deterministic mirror
Replay commit log until full day/hour
Discovered through archaeology :-)
Not scalable, point of failure
Hourly dump took 45 minutes, increasing...
2121
(Anti-)pattern: better dumping
Netflix Aegisthus
Snapshot Cassandra (fast, atomic,
reliable)
Transfer SSTables to HDFS
Replicate compaction in MapReduce
Other DBs? Depends on atomic snapshot.
22
All dumps are anti-patterns?
Typical use: Join activity events with user info
Event time != dump time
Aggregation discards information
Which users enabled X, tried, and disabled?
23
Pattern: Event source
All facts are events. Immutable, timestamped
Event stream is source of truth
No explicit “current state”
The functional data architecture?
24
Event source incarnated: unified log
Pour events into pub/sub bus, with long history.
Kafka de-facto standard.
Tap from bus to HDFS/S3 in time buckets.
Camus/Secor
Stream processing pipelines to dest topics
Replay on code changes
25
Unified log, practical considerations
Long history necessary
Must have time to fix stream process bugs
Use 3+ months and use stream as temp
DB
Unified log also useful for meta and control
Tweak Kafka for low latency
26
Event source + views
View = snapshot of aggregated state @ time
For ETL, choice of hourly/daily aggregates or
exact views
27
Logs
View View
Event source + database
Business logic may demand “current state”
Event stream is truth, keep DB in sync
28
Event source, synced database
A. Service interface generates
events and DB transactions
B. Generate stream from DB
commit log.
Postgres, MySQL -> Kafka
C.Build DB with stream
processing
29
APIAPIAPI
Deployment & orchestration
System = many machines
Desired system state = code + config
Actual state = Orchestrator(current, desired)
30
Anti-pattern: stateful orchestration
Orchestrator = Puppet|Chef|Ansible {
current.changeSomeProperties(desired)
return current
// current.otherProperties unchanged
}
31
Stateful orchestration in the trench
Desired = { case roleA: install(x,y)
case roleB: install(z) }
Current = x installed on roleB. Old x. Zombie
woke up when B load decreased.
Puppet+apt = No simple way to remove
undesired state
32
Pattern: artifacts from source
Orchestrator = Docker|Packer {
delete current
return Image(desired)
}
No state leak from existing state. Sort of.
33
Deterministic, predictable?
Image building leaky on purpose
E.g. “apt-get update && apt-get install”
Imports external state
Ephemeral databases preserve state
Ability to rebuild from unified log is
valuable
34
Jay Kreps, Confluent: Unified log
Martin Kleppman: Unified log, Bottled Water
Nathan Marz: Lambda
Sander Mak @ Jfokus: Event sourcing
Datomic
Questions?
More?
35

Functional architectural patterns

  • 1.
  • 2.
    Who’s talking? Swedish Instituteof Comp. Sc. (test tools) Sun Microsystems (very large machines) Google (Hangouts, productivity) Recorded Future (NLP startup) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling) Schibsted (data processing & modelling) 2
  • 3.
    Why functional? Verbs ... hasmade ... expanding ... ... flourishes ... merged ... has been unable to escape lingering .. built ... ... are ... placed ... say ... are ... to explode ... .. are considering ... to reopen … to recall ... 3
  • 4.
    Or object-oriented? Nouns, pronouns ...bankruptcy ... government bailout ... automaker Chrysler ... comeback ... sales ... Jeep sport utility vehicles. ... Chrysler ... part ... Fiat Chrysler Automobiles, it ... concerns ... the safety ... Jeeps ... ... Jeeps ... gas tanks ... regulators ... safety advocates ... rear-end crash. ... regulators ... an investigation ... those Jeeps ... Fiat Chrysler’s agreement ... models. 4
  • 5.
    Functional benefits? Myversion. Matches a few problems Data processing Matches a few computer properties Consistency through immutability Deterministic - replay for resilience 5
  • 6.
    Local vs distributedproperties Local Hardware provides strong consistency Faults -> death 6 Distributed Eventual consistency Faults must be survived
  • 7.
    Architectural functional patterns Personalanti-pattern experiences Strive to look for Immutability Reexecution 7
  • 8.
    MapReduce Discovered pattern, notinvention Well known, enough said Succeeded by Spark RDD paradigm 8
  • 9.
    Data flows 9 Users Page views Sales Sales reports Views with demographics Saleswith demographics Conversion analytics Conversion analytics Views with demographics Dataset artifacts, typically files with date parameter. Raw Derived
  • 10.
    Anti-pattern - isolatedbatch jobs Get data (more on that later) Cron an ETL batch job (function) Output solidifies. Mostly. Steps in isolation - often different teams What to do on ETL code changes? 10 Sales with demographics Views with demographics
  • 11.
    Pattern: data pipeline End-to-endsequences/DAG of jobs Not only exist, but treated end-to-end Input is raw, original data Separate raw data from generated 11 Users Page views Sales with demographics Conversio n analytics Conversion analytics Views with demographics
  • 12.
    Lambda architecture, part1 Save all collected data without preprocessing But timestamp on generation, register, arrival Rerun everything downstream on code change Human fault tolerance In conflict with privacy management? 12
  • 13.
    Pipeline workflow orchestration Ideally:Good old make + cluster + IDE + xUnit Test end-to-end Rebuild on upstream changes (but not all) State of practice: Luigi, Pinball, Azkaban Don’t take you all the way :-( 13
  • 14.
    Lambda architecture, part2 Parallel batch and real-time pipelines Batch more accurate, overrides Real-time for window of recent data 14
  • 15.
    Obtaining data Log things.Conceptually stable, but collection is challenging at scale. Have legacy code and master data in databases? Let us have a look. 15
  • 16.
    Database dimensioned foronline traffic Hadoop = herd of elephants Load spike Height = #mapper nodes Area = #users Anti-pattern: direct dump 16 API
  • 17.
    Direct dumps inthe trenches Company successful - #users increasing More Sqoop mappers - higher DB load Daily dump jobs went to 25h Devops firewalled off Hadoop to recover 17
  • 18.
    Anti-pattern: dump throughAPI SOA/microservice culture DB protected by throttling API not used to elephants Query area is still large Herd of elephants through gate - 1-2 weeks 18 API
  • 19.
    Anti-pattern: slave dump Protectlive service by mirroring to a dump slave No online service risk, good! Why anti-pattern? 19
  • 20.
    All dumps arenon-deterministic HDFS down? Dump later. State is gone - dump not accurate Slave replication down? Dump not accurate 20
  • 21.
    Anti-pattern: deterministic mirror Replaycommit log until full day/hour Discovered through archaeology :-) Not scalable, point of failure Hourly dump took 45 minutes, increasing... 2121
  • 22.
    (Anti-)pattern: better dumping NetflixAegisthus Snapshot Cassandra (fast, atomic, reliable) Transfer SSTables to HDFS Replicate compaction in MapReduce Other DBs? Depends on atomic snapshot. 22
  • 23.
    All dumps areanti-patterns? Typical use: Join activity events with user info Event time != dump time Aggregation discards information Which users enabled X, tried, and disabled? 23
  • 24.
    Pattern: Event source Allfacts are events. Immutable, timestamped Event stream is source of truth No explicit “current state” The functional data architecture? 24
  • 25.
    Event source incarnated:unified log Pour events into pub/sub bus, with long history. Kafka de-facto standard. Tap from bus to HDFS/S3 in time buckets. Camus/Secor Stream processing pipelines to dest topics Replay on code changes 25
  • 26.
    Unified log, practicalconsiderations Long history necessary Must have time to fix stream process bugs Use 3+ months and use stream as temp DB Unified log also useful for meta and control Tweak Kafka for low latency 26
  • 27.
    Event source +views View = snapshot of aggregated state @ time For ETL, choice of hourly/daily aggregates or exact views 27 Logs View View
  • 28.
    Event source +database Business logic may demand “current state” Event stream is truth, keep DB in sync 28
  • 29.
    Event source, synceddatabase A. Service interface generates events and DB transactions B. Generate stream from DB commit log. Postgres, MySQL -> Kafka C.Build DB with stream processing 29 APIAPIAPI
  • 30.
    Deployment & orchestration System= many machines Desired system state = code + config Actual state = Orchestrator(current, desired) 30
  • 31.
    Anti-pattern: stateful orchestration Orchestrator= Puppet|Chef|Ansible { current.changeSomeProperties(desired) return current // current.otherProperties unchanged } 31
  • 32.
    Stateful orchestration inthe trench Desired = { case roleA: install(x,y) case roleB: install(z) } Current = x installed on roleB. Old x. Zombie woke up when B load decreased. Puppet+apt = No simple way to remove undesired state 32
  • 33.
    Pattern: artifacts fromsource Orchestrator = Docker|Packer { delete current return Image(desired) } No state leak from existing state. Sort of. 33
  • 34.
    Deterministic, predictable? Image buildingleaky on purpose E.g. “apt-get update && apt-get install” Imports external state Ephemeral databases preserve state Ability to rebuild from unified log is valuable 34
  • 35.
    Jay Kreps, Confluent:Unified log Martin Kleppman: Unified log, Bottled Water Nathan Marz: Lambda Sander Mak @ Jfokus: Event sourcing Datomic Questions? More? 35