The year of COVID-19 pandemic has spotlighted as never before the many shortcomings of the world’s data management workflows. The lack of established ways to exchange and access data was a highly recognized contributing factor in our poor response to the pandemic. On multiple occasions we have witnessed how our poor practices around reproducibility and provenance have completely sidetracked major vaccine research efforts, prompting many calls for action from scientific and medical communities to address these problems.
Building a Distributed Collaborative Data Pipeline with Apache Spark
1. Building a Distributed Collaborative
Data Pipeline with Apache Spark
Sergii Mikhtoniuk
Founder @ Kamu Data Inc.
2. 2
Example: Hydroxychloroquine Study
Published in "The Lancet" journal in May
Data:
⬝ > 96,000 COVID-19 patients
⬝ 671 hospitals worldwide
⬝ Proprietary database: Surgisphere
Finding: Increased risk of in-hospital mortality
Global trials are halted completely
* https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31180-6/fulltext
3. 3
Example: Hydroxychloroquine Study (cont.)
Publication retracted in June
⬝ After numerous data inconsistencies pointed out
⬝ Data provenance could not be established
⬝ Surgisphere database goes offline
A data management issue
⬝ Was ignored for years
⬝ Derailed life-saving efforts in critical time
Just one of many examples…
4. 4
Reproducibility Crisis
Reproducibility is the foundation of the
scientific method
Nearly-impossible to achieve in practice
Problem originates in our data
management practices
* https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
6. Two Kinds of Data
Derivative DataSource Data
6
Generated by some system
⬝ Or a result of observations
Cannot be reconstructed if lost
Publishers
⬝ Have full authority
⬝ Responsible for validity
Created from other data via some
transformations
⬝ Aggregates, summaries, ML etc.
Can be reconstructed if lost
Validity depends on source data
7. 7
Source Data: Expectation
Two unrelated parties at different times should be able to access the same data
and validate that it comes unaltered from the trusted source
Stable References Validation Mechanism
8. Source Data: Reality
Temporal DatasetsNon-Temporal Datasets
8
Store "current state" of the
domain
Destructive in-place updates
Constant loss of history
Store events / observations
In-place updates
Same URL - different data
9. 9
Derivative Data: Expectation
Repeating all transformation steps produces same results as the original.
Having data, any party can verify that it was produced by declared transformations
without being accidentally or maliciously altered.
Transparency Determinism
10. 10
Derivative Data: Reality
Reproducibility is a very manual process (workflows, compliance)
Requires:
⬝ Stable references to source data
⬝ Reproducible environments
⬝ Code, libraries, transitive dependencies, OS, hardware
⬝ Self-contained projects
⬝ No external API calls
⬝ Eliminating randomness
When time is pressing - it goes out of the window
11. 11
Examples
Sharing data on Kaggle or Github
⬝ Goal: Collaborating on data cleaning
⬝ Reality: Cannot be trusted
Data hubs & portals
⬝ Goal: improve discoverability & break down silos
⬝ Reality: Validity cannot be established
Copy + Version approach
⬝ Works only within the enterprise "bubble"
⬝ Doesn’t work between independent parties
12. 12
Summary
We are mismanaging data on the most fundamental level
Collaboration on data is impossible - forward progress is constantly lost
Reached the limit of workflows - need a technical solution
13. 13
Is There a Better Way?
We set out to design a new kind of data supply chain designed from ground up for
Reproducibility & verifiability
Low latency
Complete provenance
Data reuse and collaboration
15. ODF is a protocol specification for
reliable data exchange and
transformation
The world’s first P2P data pipeline!
15
Introducing Open Data Fabric
16. 16
Strict Definition of "Data"
There is no such thing as "current state" - only history
⬝ Data is temporal
History doesn’t change
⬝ Data is immutable
Future decision making relies on history
⬝ Data must have infinite retention
Time is relative, information propagates with a delay
⬝ Use bitemporality to reflect that
17. 17
Data Model
Dataset - a potentially infinite stream of events / observations
Every event has
⬝ Event time - when it occured in the outside world
⬝ System time - when it was first observed by the system (monotonic)
Stable Ref = (dataset_id, system_time)
Validity = (stable_ref, hash)
20. 20
Temporal Metadata
Metadata is a series of events
⬝ Stores everything that has
ever influenced the data
Key enabler of:
⬝ Reproducibility
⬝ Verifiability
⬝ Dataset evolution
⬝ Efficient data sharing
⬝ Fine-grained provenance (WIP)
23. 23
Batch processing is unfit for purpose!
⬝ Relies on simplicity of non-temporal data
⬝ Ignores temporal problems
⬝ Late / out-of-order arrivals
⬝ Backfills & corrections
⬝ Arrival cadence differences in joins
Using batch would be
⬝ Extremely error-prone
⬝ Impossible to audit
Goodbye Batch
24. 24
Hello Stream Processing!
Is designed for temporal problems
⬝ Event-time processing (bitemporality)
⬝ Watermarks, Triggers
⬝ Windowing
⬝ Projections
⬝ Stream-to-Stream Joins
⬝ Projection (Temporal-Table) Joins
Useful only for near real-time processing?
⬝ A common misconception!
25. 25
Stream Processing as the Primary Transformation Method
Cons
Agnostic of how and how often data arrives
Declarative and expressive - easier to audit
Better for determinism and reproducibility
Minimal latency
Unfamiliarity
Limited framework support
Pros
SELECT
TUMBLE_ROWTIME(order_time, INTERVAL '1' DAY),
order_id,
count(*) as num_shipments,
sum(shipped_quantity) as shipped_total
FROM order_shipments
GROUP BY TUMBLE(order_time, INTERVAL '1' DAY), order_id
27. 27
Data Sharing Implications
Data storage
⬝ Root - durable and highly available (e.g. S3, GS, HDFS)
⬝ Derivative - cheap, unreliable, or none at all
⬝ A form of caching
⬝ Can be fully reconstructed from metadata and root data
Metadata
⬝ A digital passport of data
⬝ Used for verifying data integrity
⬝ Validating work of another peer to spot malicious activity
⬝ Exchanged securely between peers
34. 34
ODF Summary
Data pipeline designed around properties essential for collaboration
⬝ Addresses decades of stagnation in data management
A much stricter model
⬝ No API calls, no "black box" transformations
Requires a mindset shift
⬝ Redefining what data is and how we treat it
⬝ Embracing temporal problems
35. 35
The Next Level of Data Democratization
One of the pillars of the Digital Democracy
and the next-generation decentralized IT
Supply of factual data for blockchain contracts
⬝ With no central authority
⬝ Resistant to malicious behavior
Foundation for data monetization
⬝ Publisher incentives
Data provider for next generation web & apps
37. 37
Call for Action
Data is our modern age history book - it should be treated as such
⬝ Stop modifying and copying data
⬝ Don’t version data - use bitemporal modelling instead
Data publishers have to take ownership of reproducibility
⬝ Need good standards and tools
Non-temporal data is a local optima
⬝ Temporal data should be the default choice
Stream processing will displace batch
⬝ Let’s collaborate on improving tools not oversimplifying problems!