SlideShare a Scribd company logo
1 of 56
Download to read offline
www.scling.com
Eventually, time will kill your
data processing
Øredev, 2019-11-07
Lars Albertsson
Scling
1
www.scling.com
Out of joint
“Time is out of joint. O cursed spite,
That ever I was born to set it right.”
- Hamlet, prince of Denmark
2
www.scling.com
Out of joint
“Time is out of joint. O cursed spite,
That ever I was born to set it right.”
- Hamlet, prince of Denmark
3
www.scling.com
Out of joint
“Time is out of joint. O cursed spite,
That ever I was born to set it right.”
- Hamlet, prince of Denmark
4
www.scling.com
Out of joint
5
Free users using
premium service
www.scling.com
Time will kill us all
● Time handling causes data processing problems
● Observed issues
● Principles, patterns, anti-patterns
Goals:
● Awareness, recognition
● Tools from my toolbox
6
www.scling.com
Big data - a collaboration paradigm
7
Stream storage
Data lake
Data
democratised
www.scling.com
Data pipelines
8
Data lake
www.scling.com
More data - decreased friction
9
Data lake
Stream storage
www.scling.com
Scling - data-value-as-a-service
10
Data lake
Stream storage
● Extract value from your data
● Data platform + custom data pipelines
● Imitate data leaders:
○ Quick idea-to-production
○ Operational efficiency
Our marketing strategy:
● Promiscuously share knowledge
○ On slides devoid of glossy polish
www.scling.com
Data platform overview
11
Data lake
Batch
processing
Cold
store
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Job
Workflow
orchestration
www.scling.com
Data categories, time angle
● Facts
○ Events, observations
○ Time stamped by clock(s)
○ Continuous stream
12
www.scling.com
Data categories, time angle
● Facts
○ Events, observations
○ Time stamped by clock(s)
○ Continuous stream
● State
○ Snapshot view of system state
○ Lookup in live system. Dumped to data lake.
○ Regular intervals (daily)
13
www.scling.com
Data categories, time angle
● Facts
○ Events, observations
○ Time stamped by clock(s)
○ Continuous stream
● State
○ Snapshot view of system state
○ Lookup in live system. Dumped to data lake.
○ Regular intervals (daily)
● Claims
○ Statement about the past
○ Time window scope
14
www.scling.com
Domain time - scope of claim. “These are suspected fraudulent users for March 2019.”
Time scopes
15
Service
Registration time
Event time
Ingest time
Processing time
www.scling.com
Controlled backend - good clocks, UTC
Clocks
16
● Computer clocks measure elapsed time
● Good clocks, bad clocks, wrong clocks
Legacy systems - good clocks
Time zone?
Clients - bad clocks
Time zone?
www.scling.com
Calendars, naive
Maps time to astronomical and social domains
Naive calendar
● 365 days + leap years
● 12 months, 28-31 days
● 7 day weeks
● 24 hours
● 60 minutes
● 60 seconds
● 24 hour time zones
17
www.scling.com
Naive calendar properties
t + 1.day == t + 86400.seconds
(y + 45.years).asInt == y.asInt + 45
(date(y, 1, 1) + 364.days).month == 12
(t + 60.minutes).hour == (t.hour + 1) % 24
datetime(y, mon, d, h, m, s, tz).inZone(tz2).day in [d - 1, d, d + 1]
datetime(y, mon, d, h, m, s, tz).inZone(tz2).minute == m
date(y, 12, 29) + 1.day == date(y, 12, 30)
(date(y, 2, 28) + 2.days).month == 3
18
Can you find the counter
examples?
www.scling.com
Calendar reality
t + 1.day == t + 86400.seconds
(y + 45.years).asInt == y.asInt + 45
(date(y, 1, 1) + 364.days).month == 12
(t + 60.minutes).hour == (t.hour + 1) % 24
datetime(y, mon, d, h, m, s, tz).inZone(tz2).day in [d - 1, d, d + 1]
datetime(y, mon, d, h, m, s, tz).inZone(tz2).minute == m
date(y, 12, 29) + 1.day == date(y, 12, 30)
(date(y, 2, 28) + 2.days).month == 3
19
Leap seconds
Year zero
Julian / Gregorian
calendar switch
Sweden: 1712-02-30
Daylight savings time
Samoa crosses
date line. Again.
Time zones:
Span 26 hours
15 minute granularity
www.scling.com
Except DST
● Analytics quality
● Changes with political decision
● Might affect technical systems
● (Affects health!)
Small problems in practice
20
Leap seconds
Year zero
Julian / Gregorian
calendar switch
Sweden: 1712-02-30
Samoa crosses
date line. Again.
Time zones:
Span 26 hours
15 minute granularity
Daylight savings time
www.scling.com
Daylight savings example
Credits: Rachel Wigell, https://twitter.com/the_databae/status/1191479005192605698
21
www.scling.com
Typical loading dock ingress
22
● File arrives every hour
● Ingest job copies to lake,
applies data platform
conventions
● Source system determines
format, naming, and
timestamps
class SalesArrived(ExternalTask):
"""Receive a file with sales transactions every hour."""
hour = DateHourParameter()
def matches(self):
return [u for u in GCSClient().listdir(f'gs://ingress/sales/')
if re.match(rf'{self.hour:%Y%m%d%h}.*.json', u)]
def output(self):
return GCSTarget(self.matches()[0])
@requires(SalesArrived)
class SalesIngest(Task):
def output(self):
return GCSFlagTarget('gs://lake/sales/'
f'{self.hour:year=%Y/month=%m/day=%d/hour=%h}/')
def run(self):
_, base = self.input().path.rsplit('/', 1)
dst = f'{self.output().path}{base}'
self.output().fs.copy(src.path, dst)
self.output().fs.put_string('',
f'{self.output().path}{self.output().flag)}'
crontab:
*/10 * * * * luigi RangeHourly --of SalesIngest
www.scling.com
Typical loading dock ingress
23
● File arrives every hour
● Ingest job copies to lake,
applies data platform
conventions
● Source system determines
format, naming, and
timestamps, incl. zone
● Spring: halted ingest
Autumn: data loss
class SalesArrived(ExternalTask):
"""Receive a file with sales transactions every hour."""
hour = DateHourParameter()
def matches(self):
return [u for u in GCSClient().listdir(f'gs://ingress/sales/')
if re.match(rf'{self.hour:%Y%m%d%h}.*.json', u)]
def output(self):
return GCSTarget(self.matches()[0])
@requires(SalesArrived)
class SalesIngest(Task):
def output(self):
return GCSFlagTarget('gs://lake/sales/'
f'{self.hour:year=%Y/month=%m/day=%d/hour=%h}/')
def run(self):
_, base = self.input().path.rsplit('/', 1)
dst = f'{self.output().path}{base}'
self.output().fs.copy(src.path, dst)
self.output().fs.put_string('',
f'{self.output().path}{self.output().flag)}'
crontab:
*/10 * * * * luigi RangeHourly --of SalesIngest
www.scling.com
Using snapshots
● join(event, snapshot) → always time mismatch
● Usually acceptable
○ In one direction
24
DB’DB
join?
www.scling.com
Using snapshots
● join(event, snapshot) → always time mismatch
● Usually acceptable
○ In one direction
25
DB’DB
join?
Free users using
premium service
www.scling.com
● Time mismatch in both directions
Window misalign
26
DB
join
www.scling.com
Event sourcing
● Every change to unified log == source of truth
● snapshot(t + 1) = sum(snapshot(t), events(t, t+1))
● Allows view & join at any point in time
○ But more complex
27
DB’DB
www.scling.com
Easier to express with streaming?
state + event → state'
● State is a view of sum of all events
○ Join with the sum table
○ Beautiful!
○ Not simple in practice
● Mixing event time and processing time
○ Every join is a race condition
28
Stream
Stream
Stream
Stream
www.scling.com
Stream window operations
29
Stream
Stream
Stream
Stream
Sliding windows
Tumbling
windows
Stream
Stream
Stream
Window
join
www.scling.com
Window over what?
● Number of events
● Processing time
● Registration time
● Event time
30
Less relevant for domain
Deterministic
Predictable semantics
Predictable resources
More relevant for domain
Deterministic
Unpredictable semantics
Unpredictable resources
Less relevant for domain
Indeterministic
Unpredictable semantics
Predictable resources
www.scling.com
Batch is easier?
● Yes
● But, some pitfalls
Event ingest
31
How to divide events
to datasets?
When to start
processing?
www.scling.com
Ancient data collection:
● Events in log files, partitioned hourly
● Copy each hour of every host to lake
Anti-pattern: Bucket by registration time
32
Service
All hosts
represented?
www.scling.com
● Start processing optimistically
● Reprocess after x% new data has arrived
Anti-pattern: Reprocess
33
www.scling.com
● Start processing optimistically
● Reprocess after x% new data has arrived
Solution space:
● Apache Beam / Google → stream processing ops
● Data versioning
● Provenance
Requires good (lacking) tooling
Supporting reprocessing
34
www.scling.com
Event collection
35
Service
Registration time
Event time
Ingest time
Processing time
www.scling.com
● Bundle incoming events into datasets
○ Bucket on ingest / wall-clock time
○ Predictable bucketing, e.g. hour
Pattern: Ingest time bucketing
36
clicks/2016/02/08/14
clicks/2016/02/08/15
● Sealed quickly at end of hour
● Mark dataset as complete
○ E.g. _SUCCESS flag
www.scling.com
When data is late
37
or
www.scling.com
val orderLateCounter = longAccumulator("order-event-late")
val hourPaths = conf.order.split(",")
val order = hourPaths
.map(spark.read.avro(_))
.reduce(a, b => a.union(b))
val orderThisHour = order
.map({ cl =>
# Count the events that came after the delay window
if (cl.eventTime.hour + config.delayHours <
config.hour) {
orderLateCounter.add(1)
}
order
})
.filter(cl => cl.eventTime.hour == config.hour)
class OrderShuffle(SparkSubmitTask):
hour = DateHourParameter()
delay_hours = IntParameter()
jar = 'orderpipeline.jar'
entry_class = 'com.example.shop.OrderShuffleJob'
def requires(self):
# Note: This delays processing by N hours.
return [Order(hour=hour) for hour in
[self.hour + timedelta(hour=h) for h in
range(self.delay_hours)]]
def output(self):
return HdfsTarget("/prod/red/order/v1/"
f"delay={self.delay}/"
f"{self.hour:%Y/%m/%d/%H}/")
def app_options(self):
return [ "--hour", self.hour,
"--delay-hours", self.delay_hours,
"--order",
",".join([i.path for i in self.input()]),
"--output", self.output().path]
Incompleteness recovery
38
www.scling.com
class OrderShuffleAll(WrapperTask):
hour = DateHourParameter()
def requires(self):
return [OrderShuffle(hour=self.hour, delay_hour=d)
for d in [0, 4, 12]]
class OrderDashboard(mysql.CopyToTable):
hour = DateHourParameter()
def requires(self):
return OrderShuffle(hour=self.hour, delay_hour=0)
class FinancialReport(SparkSubmitTask):
date = DateParameter()
def requires(self):
return [OrderShuffle(
hour=datetime.combine(self.date, time(hours=h)),
delay_hour=12)
for h in range(24)]
Fast data, complete data
39
Delay: 0
Delay: 4
Delay: 12
www.scling.com
class ReportBase(SparkSubmitTask):
date = DateParameter()
jar = 'reportpipeline.jar'
entry_class = 'com.example.shop.ReportJob'
def requires(self):
return [OrderShuffle(
hour=datetime.combine(self.date, time(hour=h)),
delay_hour=self.delay)
for h in range(24)]
class PreliminaryReport(ReportBase):
delay = 0
class FinalReport(ReportBase):
delay = 12
def requires(self):
return super().requires() + [SuspectedFraud(self.date)]
Lambda in batch
40
www.scling.com
Human-delayed data
“These are my compensation claims for last January.”
● Want: accurate reporting
● Want: current report status
Anti-pattern: Reprocessing
41
www.scling.com
class ClaimsReport(SparkSubmitTask):
domain_month = MonthParameter()
date = DateParameter()
jar = 'reportpipeline.jar'
entry_class = 'com.example.shop.ClaimsReportJob'
def requires(self):
return [Claims(
domain_month=self.domain_month,
date=d)
for date in date_range(
self.domain_month + timedelta(month=1),
self.date)]
Dual time scopes
42
● What we knew about month X at date Y
● Deterministic
○ Can be backfilled, audited
● May require sparse dependency trees
www.scling.com
Recursive dependencies
● Use yesterday’s dataset + some delta
○ Starting point necessary
● Often convenient, but operational risk
○ Slow backfills
43
class StockAggregateJob(SparkSubmitTask):
date = DateParameter()
jar = 'stockpipeline.jar'
entry_class = 'com.example.shop.StockAggregate'
def requires(self):
yesterday = self.date - timedelta(days=1)
previous = StockAggregateJob( date=yesterday)
return [StockUpdate(date=self.date), previous]
www.scling.com
Recursive dependency strides
● Mitigation: recursive strides
● y/m/1 depends on y/m-1/1
● Others depend on y/m/1 + all previous
days in this month
44
class StockAggregateStrideJob(SparkSubmitTask):
date = DateParameter()
jar = 'stockpipeline.jar'
entry_class = 'com.example.shop.StockAggregate'
def requires(self):
first_in_month = self.date.replace(day=1)
base = first_in_month - relativedelta(months=1) 
if self.date.day == 1 else first_in_month
return ([StockAggregateStrideJob(date=base)] +
[StockUpdate(date=d) for d in
rrule(freq=DAILY, dtstart=base, until=self.date)])
www.scling.com
Business logic with history
Example: Forming sessions
● Valuable for
○ User insights
○ Product insights
○ A/B testing
○ ...
45
www.scling.com
What is a session?
● Sequence of clicks at most 5 minutes
apart?
● In order to emit sessions in one hour,
which hours of clicks are needed?
46
www.scling.com
What is a session?
● Sequence of clicks at most 5 minutes
apart?
● Maximum length 3 hours?
● In order to emit sessions in one hour,
which hours of clicks are needed?
47
www.scling.com
What is a session?
● Sequence of clicks at most 5 minutes
apart.
● Maximum length 3 hours.
Examples, window = 5am - 9am:
● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]?
○ Two sessions, 3 and 2 minutes long.
48
www.scling.com
What is a session?
● Sequence of clicks at most 5 minutes
apart.
● Maximum length 3 hours.
Examples, window = 5am - 9am:
● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]?
○ Two sessions, 3 and 2 minutes long.
● [5:00, 5:01, 5:03]?
○ One session, 3 minutes?
49
www.scling.com
What is a session?
● Sequence of clicks at most 5 minutes
apart.
● Maximum length 3 hours.
Examples, window = 5am - 9am:
● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]?
○ Two sessions, 3 and 2 minutes long.
● [5:00, 5:01, 5:03]?
○ One session, 3 minutes?
○ [(4:57), 5:00, 5:01, 5:03]?
○ [(2:01), (every 3 minutes), (4:57), 5:00,
5:01, 5:03]?
50
www.scling.com
Occurrences with unbounded time spans
● E.g. sessions, behavioural patterns
● You often need a broader time range than expected
● You may need data from the future
● You may need infinite history
○ Recursive strides?
○ Introduce static limits, e.g. cut all sessions at midnight?
○ Emit counters to monitor assumptions.
51
www.scling.com
Secrets of valuable data engineering
1. Avoid complexity and distributed systems
○ Your data fits in one machine.
2. Pick the slowest data integration you can live with
○ Batch >> streaming >> service.
○ Slow data → easy operations → high innovation speed
3. Functional architectural principles
○ Pipelines
○ Immutability
○ Reproducibility
○ Idempotency
4. Master workflow orchestration
52
www.scling.com
Team concurrency - DataOps
● Pipelines
○ Parallel development without risk
● Immutability
○ Data reusability without risk
● Reproducibility, idempotency
○ Preserving immutability
○ Reduces operational risk
○ Stable experiments
● Workflow orchestration
○ Data handover between systems & teams
○ Reduces operational overhead
53
Data science
reproducibility crisis!
www.scling.com
Resources, credits
Time libraries:
● Java: Joda time java.time
● Scala: chronoscala
● Python: dateutil, pendulum
Presentation on operational tradeoffs:
https://www.slideshare.net/lallea/data-ops-in-practice
Data engineering resources:
https://www.scling.com/reading-list
54
Thank you,
● Konstantinos Chaidos, Spotify
● Lena Sundin, independent
www.scling.com
Tech has massive impact on society
55
Product?
Supplier?
Employer?
Make an active
choice whether to
have an impact!
Cloud?
www.scling.com
Laptop sticker
Vintage data visualisations, by Karin Lind.
● Charles Minard: Napoleon’s Russian campaign of 1812. Drawn 1869.
● Matthew F Maury: Wind and Current Chart of the North Atlantic. Drawn 1852.
● Florence Nightingale: Causes of Mortality in the Army of the East. Crimean
war, 1854-1856. Drawn 1858.
○ Blue = disease, red = wounds, black = battle + other.
● Harold Craft: Radio Observations of the Pulse Profiles and Dispersion
Measures of Twelve Pulsars, 1970
○ Joy Division:
Unknown Pleasures, 1979
○ “Joy plot” → “ridge plot”
56

More Related Content

What's hot

Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero Lars Albertsson
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data SuccessLars Albertsson
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patternsLars Albertsson
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Institute e-Austria Timisoara
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
 
Real-time Data Warehouse Upgrade – Success Stories
Real-time Data Warehouse Upgrade – Success StoriesReal-time Data Warehouse Upgrade – Success Stories
Real-time Data Warehouse Upgrade – Success StoriesMichael Rainey
 
Training Series: Build APIs with Neo4j GraphQL Library
Training Series: Build APIs with Neo4j GraphQL LibraryTraining Series: Build APIs with Neo4j GraphQL Library
Training Series: Build APIs with Neo4j GraphQL LibraryNeo4j
 

What's hot (20)

Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Real-time Data Warehouse Upgrade – Success Stories
Real-time Data Warehouse Upgrade – Success StoriesReal-time Data Warehouse Upgrade – Success Stories
Real-time Data Warehouse Upgrade – Success Stories
 
Training Series: Build APIs with Neo4j GraphQL Library
Training Series: Build APIs with Neo4j GraphQL LibraryTraining Series: Build APIs with Neo4j GraphQL Library
Training Series: Build APIs with Neo4j GraphQL Library
 

Similar to Eventually, time will kill your data processing

Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
[WSO2Con EU 2018] Patterns for Building Streaming Apps
[WSO2Con EU 2018] Patterns for Building Streaming Apps[WSO2Con EU 2018] Patterns for Building Streaming Apps
[WSO2Con EU 2018] Patterns for Building Streaming AppsWSO2
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
Sql server performance tuning
Sql server performance tuningSql server performance tuning
Sql server performance tuningngupt28
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Simona Meriam
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processingconfluent
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 

Similar to Eventually, time will kill your data processing (20)

Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
[WSO2Con EU 2018] Patterns for Building Streaming Apps
[WSO2Con EU 2018] Patterns for Building Streaming Apps[WSO2Con EU 2018] Patterns for Building Streaming Apps
[WSO2Con EU 2018] Patterns for Building Streaming Apps
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Sql server performance tuning
Sql server performance tuningSql server performance tuning
Sql server performance tuning
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data ArtisansStream processing with Apache Flink - Maximilian Michels Data Artisans
Stream processing with Apache Flink - Maximilian Michels Data Artisans
 
Big Data Warsaw
Big Data WarsawBig Data Warsaw
Big Data Warsaw
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 

More from Lars Albertsson

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 

More from Lars Albertsson (10)

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 

Recently uploaded

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 

Recently uploaded (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 

Eventually, time will kill your data processing

  • 1. www.scling.com Eventually, time will kill your data processing Øredev, 2019-11-07 Lars Albertsson Scling 1
  • 2. www.scling.com Out of joint “Time is out of joint. O cursed spite, That ever I was born to set it right.” - Hamlet, prince of Denmark 2
  • 3. www.scling.com Out of joint “Time is out of joint. O cursed spite, That ever I was born to set it right.” - Hamlet, prince of Denmark 3
  • 4. www.scling.com Out of joint “Time is out of joint. O cursed spite, That ever I was born to set it right.” - Hamlet, prince of Denmark 4
  • 5. www.scling.com Out of joint 5 Free users using premium service
  • 6. www.scling.com Time will kill us all ● Time handling causes data processing problems ● Observed issues ● Principles, patterns, anti-patterns Goals: ● Awareness, recognition ● Tools from my toolbox 6
  • 7. www.scling.com Big data - a collaboration paradigm 7 Stream storage Data lake Data democratised
  • 9. www.scling.com More data - decreased friction 9 Data lake Stream storage
  • 10. www.scling.com Scling - data-value-as-a-service 10 Data lake Stream storage ● Extract value from your data ● Data platform + custom data pipelines ● Imitate data leaders: ○ Quick idea-to-production ○ Operational efficiency Our marketing strategy: ● Promiscuously share knowledge ○ On slides devoid of glossy polish
  • 11. www.scling.com Data platform overview 11 Data lake Batch processing Cold store Dataset Pipeline Service Service Online services Offline data platform Job Workflow orchestration
  • 12. www.scling.com Data categories, time angle ● Facts ○ Events, observations ○ Time stamped by clock(s) ○ Continuous stream 12
  • 13. www.scling.com Data categories, time angle ● Facts ○ Events, observations ○ Time stamped by clock(s) ○ Continuous stream ● State ○ Snapshot view of system state ○ Lookup in live system. Dumped to data lake. ○ Regular intervals (daily) 13
  • 14. www.scling.com Data categories, time angle ● Facts ○ Events, observations ○ Time stamped by clock(s) ○ Continuous stream ● State ○ Snapshot view of system state ○ Lookup in live system. Dumped to data lake. ○ Regular intervals (daily) ● Claims ○ Statement about the past ○ Time window scope 14
  • 15. www.scling.com Domain time - scope of claim. “These are suspected fraudulent users for March 2019.” Time scopes 15 Service Registration time Event time Ingest time Processing time
  • 16. www.scling.com Controlled backend - good clocks, UTC Clocks 16 ● Computer clocks measure elapsed time ● Good clocks, bad clocks, wrong clocks Legacy systems - good clocks Time zone? Clients - bad clocks Time zone?
  • 17. www.scling.com Calendars, naive Maps time to astronomical and social domains Naive calendar ● 365 days + leap years ● 12 months, 28-31 days ● 7 day weeks ● 24 hours ● 60 minutes ● 60 seconds ● 24 hour time zones 17
  • 18. www.scling.com Naive calendar properties t + 1.day == t + 86400.seconds (y + 45.years).asInt == y.asInt + 45 (date(y, 1, 1) + 364.days).month == 12 (t + 60.minutes).hour == (t.hour + 1) % 24 datetime(y, mon, d, h, m, s, tz).inZone(tz2).day in [d - 1, d, d + 1] datetime(y, mon, d, h, m, s, tz).inZone(tz2).minute == m date(y, 12, 29) + 1.day == date(y, 12, 30) (date(y, 2, 28) + 2.days).month == 3 18 Can you find the counter examples?
  • 19. www.scling.com Calendar reality t + 1.day == t + 86400.seconds (y + 45.years).asInt == y.asInt + 45 (date(y, 1, 1) + 364.days).month == 12 (t + 60.minutes).hour == (t.hour + 1) % 24 datetime(y, mon, d, h, m, s, tz).inZone(tz2).day in [d - 1, d, d + 1] datetime(y, mon, d, h, m, s, tz).inZone(tz2).minute == m date(y, 12, 29) + 1.day == date(y, 12, 30) (date(y, 2, 28) + 2.days).month == 3 19 Leap seconds Year zero Julian / Gregorian calendar switch Sweden: 1712-02-30 Daylight savings time Samoa crosses date line. Again. Time zones: Span 26 hours 15 minute granularity
  • 20. www.scling.com Except DST ● Analytics quality ● Changes with political decision ● Might affect technical systems ● (Affects health!) Small problems in practice 20 Leap seconds Year zero Julian / Gregorian calendar switch Sweden: 1712-02-30 Samoa crosses date line. Again. Time zones: Span 26 hours 15 minute granularity Daylight savings time
  • 21. www.scling.com Daylight savings example Credits: Rachel Wigell, https://twitter.com/the_databae/status/1191479005192605698 21
  • 22. www.scling.com Typical loading dock ingress 22 ● File arrives every hour ● Ingest job copies to lake, applies data platform conventions ● Source system determines format, naming, and timestamps class SalesArrived(ExternalTask): """Receive a file with sales transactions every hour.""" hour = DateHourParameter() def matches(self): return [u for u in GCSClient().listdir(f'gs://ingress/sales/') if re.match(rf'{self.hour:%Y%m%d%h}.*.json', u)] def output(self): return GCSTarget(self.matches()[0]) @requires(SalesArrived) class SalesIngest(Task): def output(self): return GCSFlagTarget('gs://lake/sales/' f'{self.hour:year=%Y/month=%m/day=%d/hour=%h}/') def run(self): _, base = self.input().path.rsplit('/', 1) dst = f'{self.output().path}{base}' self.output().fs.copy(src.path, dst) self.output().fs.put_string('', f'{self.output().path}{self.output().flag)}' crontab: */10 * * * * luigi RangeHourly --of SalesIngest
  • 23. www.scling.com Typical loading dock ingress 23 ● File arrives every hour ● Ingest job copies to lake, applies data platform conventions ● Source system determines format, naming, and timestamps, incl. zone ● Spring: halted ingest Autumn: data loss class SalesArrived(ExternalTask): """Receive a file with sales transactions every hour.""" hour = DateHourParameter() def matches(self): return [u for u in GCSClient().listdir(f'gs://ingress/sales/') if re.match(rf'{self.hour:%Y%m%d%h}.*.json', u)] def output(self): return GCSTarget(self.matches()[0]) @requires(SalesArrived) class SalesIngest(Task): def output(self): return GCSFlagTarget('gs://lake/sales/' f'{self.hour:year=%Y/month=%m/day=%d/hour=%h}/') def run(self): _, base = self.input().path.rsplit('/', 1) dst = f'{self.output().path}{base}' self.output().fs.copy(src.path, dst) self.output().fs.put_string('', f'{self.output().path}{self.output().flag)}' crontab: */10 * * * * luigi RangeHourly --of SalesIngest
  • 24. www.scling.com Using snapshots ● join(event, snapshot) → always time mismatch ● Usually acceptable ○ In one direction 24 DB’DB join?
  • 25. www.scling.com Using snapshots ● join(event, snapshot) → always time mismatch ● Usually acceptable ○ In one direction 25 DB’DB join? Free users using premium service
  • 26. www.scling.com ● Time mismatch in both directions Window misalign 26 DB join
  • 27. www.scling.com Event sourcing ● Every change to unified log == source of truth ● snapshot(t + 1) = sum(snapshot(t), events(t, t+1)) ● Allows view & join at any point in time ○ But more complex 27 DB’DB
  • 28. www.scling.com Easier to express with streaming? state + event → state' ● State is a view of sum of all events ○ Join with the sum table ○ Beautiful! ○ Not simple in practice ● Mixing event time and processing time ○ Every join is a race condition 28 Stream Stream Stream Stream
  • 29. www.scling.com Stream window operations 29 Stream Stream Stream Stream Sliding windows Tumbling windows Stream Stream Stream Window join
  • 30. www.scling.com Window over what? ● Number of events ● Processing time ● Registration time ● Event time 30 Less relevant for domain Deterministic Predictable semantics Predictable resources More relevant for domain Deterministic Unpredictable semantics Unpredictable resources Less relevant for domain Indeterministic Unpredictable semantics Predictable resources
  • 31. www.scling.com Batch is easier? ● Yes ● But, some pitfalls Event ingest 31 How to divide events to datasets? When to start processing?
  • 32. www.scling.com Ancient data collection: ● Events in log files, partitioned hourly ● Copy each hour of every host to lake Anti-pattern: Bucket by registration time 32 Service All hosts represented?
  • 33. www.scling.com ● Start processing optimistically ● Reprocess after x% new data has arrived Anti-pattern: Reprocess 33
  • 34. www.scling.com ● Start processing optimistically ● Reprocess after x% new data has arrived Solution space: ● Apache Beam / Google → stream processing ops ● Data versioning ● Provenance Requires good (lacking) tooling Supporting reprocessing 34
  • 36. www.scling.com ● Bundle incoming events into datasets ○ Bucket on ingest / wall-clock time ○ Predictable bucketing, e.g. hour Pattern: Ingest time bucketing 36 clicks/2016/02/08/14 clicks/2016/02/08/15 ● Sealed quickly at end of hour ● Mark dataset as complete ○ E.g. _SUCCESS flag
  • 38. www.scling.com val orderLateCounter = longAccumulator("order-event-late") val hourPaths = conf.order.split(",") val order = hourPaths .map(spark.read.avro(_)) .reduce(a, b => a.union(b)) val orderThisHour = order .map({ cl => # Count the events that came after the delay window if (cl.eventTime.hour + config.delayHours < config.hour) { orderLateCounter.add(1) } order }) .filter(cl => cl.eventTime.hour == config.hour) class OrderShuffle(SparkSubmitTask): hour = DateHourParameter() delay_hours = IntParameter() jar = 'orderpipeline.jar' entry_class = 'com.example.shop.OrderShuffleJob' def requires(self): # Note: This delays processing by N hours. return [Order(hour=hour) for hour in [self.hour + timedelta(hour=h) for h in range(self.delay_hours)]] def output(self): return HdfsTarget("/prod/red/order/v1/" f"delay={self.delay}/" f"{self.hour:%Y/%m/%d/%H}/") def app_options(self): return [ "--hour", self.hour, "--delay-hours", self.delay_hours, "--order", ",".join([i.path for i in self.input()]), "--output", self.output().path] Incompleteness recovery 38
  • 39. www.scling.com class OrderShuffleAll(WrapperTask): hour = DateHourParameter() def requires(self): return [OrderShuffle(hour=self.hour, delay_hour=d) for d in [0, 4, 12]] class OrderDashboard(mysql.CopyToTable): hour = DateHourParameter() def requires(self): return OrderShuffle(hour=self.hour, delay_hour=0) class FinancialReport(SparkSubmitTask): date = DateParameter() def requires(self): return [OrderShuffle( hour=datetime.combine(self.date, time(hours=h)), delay_hour=12) for h in range(24)] Fast data, complete data 39 Delay: 0 Delay: 4 Delay: 12
  • 40. www.scling.com class ReportBase(SparkSubmitTask): date = DateParameter() jar = 'reportpipeline.jar' entry_class = 'com.example.shop.ReportJob' def requires(self): return [OrderShuffle( hour=datetime.combine(self.date, time(hour=h)), delay_hour=self.delay) for h in range(24)] class PreliminaryReport(ReportBase): delay = 0 class FinalReport(ReportBase): delay = 12 def requires(self): return super().requires() + [SuspectedFraud(self.date)] Lambda in batch 40
  • 41. www.scling.com Human-delayed data “These are my compensation claims for last January.” ● Want: accurate reporting ● Want: current report status Anti-pattern: Reprocessing 41
  • 42. www.scling.com class ClaimsReport(SparkSubmitTask): domain_month = MonthParameter() date = DateParameter() jar = 'reportpipeline.jar' entry_class = 'com.example.shop.ClaimsReportJob' def requires(self): return [Claims( domain_month=self.domain_month, date=d) for date in date_range( self.domain_month + timedelta(month=1), self.date)] Dual time scopes 42 ● What we knew about month X at date Y ● Deterministic ○ Can be backfilled, audited ● May require sparse dependency trees
  • 43. www.scling.com Recursive dependencies ● Use yesterday’s dataset + some delta ○ Starting point necessary ● Often convenient, but operational risk ○ Slow backfills 43 class StockAggregateJob(SparkSubmitTask): date = DateParameter() jar = 'stockpipeline.jar' entry_class = 'com.example.shop.StockAggregate' def requires(self): yesterday = self.date - timedelta(days=1) previous = StockAggregateJob( date=yesterday) return [StockUpdate(date=self.date), previous]
  • 44. www.scling.com Recursive dependency strides ● Mitigation: recursive strides ● y/m/1 depends on y/m-1/1 ● Others depend on y/m/1 + all previous days in this month 44 class StockAggregateStrideJob(SparkSubmitTask): date = DateParameter() jar = 'stockpipeline.jar' entry_class = 'com.example.shop.StockAggregate' def requires(self): first_in_month = self.date.replace(day=1) base = first_in_month - relativedelta(months=1) if self.date.day == 1 else first_in_month return ([StockAggregateStrideJob(date=base)] + [StockUpdate(date=d) for d in rrule(freq=DAILY, dtstart=base, until=self.date)])
  • 45. www.scling.com Business logic with history Example: Forming sessions ● Valuable for ○ User insights ○ Product insights ○ A/B testing ○ ... 45
  • 46. www.scling.com What is a session? ● Sequence of clicks at most 5 minutes apart? ● In order to emit sessions in one hour, which hours of clicks are needed? 46
  • 47. www.scling.com What is a session? ● Sequence of clicks at most 5 minutes apart? ● Maximum length 3 hours? ● In order to emit sessions in one hour, which hours of clicks are needed? 47
  • 48. www.scling.com What is a session? ● Sequence of clicks at most 5 minutes apart. ● Maximum length 3 hours. Examples, window = 5am - 9am: ● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]? ○ Two sessions, 3 and 2 minutes long. 48
  • 49. www.scling.com What is a session? ● Sequence of clicks at most 5 minutes apart. ● Maximum length 3 hours. Examples, window = 5am - 9am: ● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]? ○ Two sessions, 3 and 2 minutes long. ● [5:00, 5:01, 5:03]? ○ One session, 3 minutes? 49
  • 50. www.scling.com What is a session? ● Sequence of clicks at most 5 minutes apart. ● Maximum length 3 hours. Examples, window = 5am - 9am: ● Clicks at [6:00, 6:01, 6:03, 6:45, 6:47]? ○ Two sessions, 3 and 2 minutes long. ● [5:00, 5:01, 5:03]? ○ One session, 3 minutes? ○ [(4:57), 5:00, 5:01, 5:03]? ○ [(2:01), (every 3 minutes), (4:57), 5:00, 5:01, 5:03]? 50
  • 51. www.scling.com Occurrences with unbounded time spans ● E.g. sessions, behavioural patterns ● You often need a broader time range than expected ● You may need data from the future ● You may need infinite history ○ Recursive strides? ○ Introduce static limits, e.g. cut all sessions at midnight? ○ Emit counters to monitor assumptions. 51
  • 52. www.scling.com Secrets of valuable data engineering 1. Avoid complexity and distributed systems ○ Your data fits in one machine. 2. Pick the slowest data integration you can live with ○ Batch >> streaming >> service. ○ Slow data → easy operations → high innovation speed 3. Functional architectural principles ○ Pipelines ○ Immutability ○ Reproducibility ○ Idempotency 4. Master workflow orchestration 52
  • 53. www.scling.com Team concurrency - DataOps ● Pipelines ○ Parallel development without risk ● Immutability ○ Data reusability without risk ● Reproducibility, idempotency ○ Preserving immutability ○ Reduces operational risk ○ Stable experiments ● Workflow orchestration ○ Data handover between systems & teams ○ Reduces operational overhead 53 Data science reproducibility crisis!
  • 54. www.scling.com Resources, credits Time libraries: ● Java: Joda time java.time ● Scala: chronoscala ● Python: dateutil, pendulum Presentation on operational tradeoffs: https://www.slideshare.net/lallea/data-ops-in-practice Data engineering resources: https://www.scling.com/reading-list 54 Thank you, ● Konstantinos Chaidos, Spotify ● Lena Sundin, independent
  • 55. www.scling.com Tech has massive impact on society 55 Product? Supplier? Employer? Make an active choice whether to have an impact! Cloud?
  • 56. www.scling.com Laptop sticker Vintage data visualisations, by Karin Lind. ● Charles Minard: Napoleon’s Russian campaign of 1812. Drawn 1869. ● Matthew F Maury: Wind and Current Chart of the North Atlantic. Drawn 1852. ● Florence Nightingale: Causes of Mortality in the Army of the East. Crimean war, 1854-1856. Drawn 1858. ○ Blue = disease, red = wounds, black = battle + other. ● Harold Craft: Radio Observations of the Pulse Profiles and Dispersion Measures of Twelve Pulsars, 1970 ○ Joy Division: Unknown Pleasures, 1979 ○ “Joy plot” → “ridge plot” 56