Building a Distributed Collaborative Data Pipeline with Apache Spark

Building a Distributed Collaborative
Data Pipeline with Apache Spark
Sergii Mikhtoniuk
Founder @ Kamu Data Inc.

2
Example: Hydroxychloroquine Study
Published in "The Lancet" journal in May
Data:
⬝ > 96,000 COVID-19 patients
⬝ 671 hospitals worldwide
⬝ Proprietary database: Surgisphere
Finding: Increased risk of in-hospital mortality
Global trials are halted completely
* https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31180-6/fulltext

3
Example: Hydroxychloroquine Study (cont.)
Publication retracted in June
⬝ After numerous data inconsistencies pointed out
⬝ Data provenance could not be established
⬝ Surgisphere database goes offline
A data management issue
⬝ Was ignored for years
⬝ Derailed life-saving efforts in critical time
Just one of many examples…

4
Reproducibility Crisis
Reproducibility is the foundation of the
scientific method
Nearly-impossible to achieve in practice
Problem originates in our data
management practices
* https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970

Reproducibility & Verifiability
in Modern Data

Two Kinds of Data
Derivative DataSource Data
6
Generated by some system
⬝ Or a result of observations
Cannot be reconstructed if lost
Publishers
⬝ Have full authority
⬝ Responsible for validity
Created from other data via some
transformations
⬝ Aggregates, summaries, ML etc.
Can be reconstructed if lost
Validity depends on source data

7
Source Data: Expectation
Two unrelated parties at different times should be able to access the same data
and validate that it comes unaltered from the trusted source
Stable References Validation Mechanism

Source Data: Reality
Temporal DatasetsNon-Temporal Datasets
8
Store "current state" of the
domain
Destructive in-place updates
Constant loss of history
Store events / observations
In-place updates
Same URL - different data

9
Derivative Data: Expectation
Repeating all transformation steps produces same results as the original.
Having data, any party can verify that it was produced by declared transformations
without being accidentally or maliciously altered.
Transparency Determinism

10
Derivative Data: Reality
Reproducibility is a very manual process (workflows, compliance)
Requires:
⬝ Stable references to source data
⬝ Reproducible environments
⬝ Code, libraries, transitive dependencies, OS, hardware
⬝ Self-contained projects
⬝ No external API calls
⬝ Eliminating randomness
When time is pressing - it goes out of the window

11
Examples
Sharing data on Kaggle or Github
⬝ Goal: Collaborating on data cleaning
⬝ Reality: Cannot be trusted
Data hubs & portals
⬝ Goal: improve discoverability & break down silos
⬝ Reality: Validity cannot be established
Copy + Version approach
⬝ Works only within the enterprise "bubble"
⬝ Doesn’t work between independent parties

12
Summary
We are mismanaging data on the most fundamental level
Collaboration on data is impossible - forward progress is constantly lost
Reached the limit of workflows - need a technical solution

13
Is There a Better Way?
We set out to design a new kind of data supply chain designed from ground up for
Reproducibility & verifiability
Low latency
Complete provenance
Data reuse and collaboration

ODF is a protocol specification for
reliable data exchange and
transformation
The world’s first P2P data pipeline!
15
Introducing Open Data Fabric

16
Strict Definition of "Data"
There is no such thing as "current state" - only history
⬝ Data is temporal
History doesn’t change
⬝ Data is immutable
Future decision making relies on history
⬝ Data must have infinite retention
Time is relative, information propagates with a delay
⬝ Use bitemporality to reflect that

17
Data Model
Dataset - a potentially infinite stream of events / observations
Every event has
⬝ Event time - when it occured in the outside world
⬝ System time - when it was first observed by the system (monotonic)
Stable Ref = (dataset_id, system_time)
Validity = (stable_ref, hash)

Dataset Manifests
Derivative DatasetRoot Dataset
19

20
Temporal Metadata
Metadata is a series of events
⬝ Stores everything that has
ever influenced the data
Key enabler of:
⬝ Reproducibility
⬝ Verifiability
⬝ Dataset evolution
⬝ Efficient data sharing
⬝ Fine-grained provenance (WIP)

21
Metadata Chain
Blockchain-like
Cryptographically secured
Linked to data
Extensible
⬝ Semantics / Ontology
⬝ Governance / Licensing
⬝ Privacy / Security
⬝ …

23
Batch processing is unfit for purpose!
⬝ Relies on simplicity of non-temporal data
⬝ Ignores temporal problems
⬝ Late / out-of-order arrivals
⬝ Backfills & corrections
⬝ Arrival cadence differences in joins
Using batch would be
⬝ Extremely error-prone
⬝ Impossible to audit
Goodbye Batch

24
Hello Stream Processing!
Is designed for temporal problems
⬝ Event-time processing (bitemporality)
⬝ Watermarks, Triggers
⬝ Windowing
⬝ Projections
⬝ Stream-to-Stream Joins
⬝ Projection (Temporal-Table) Joins
Useful only for near real-time processing?
⬝ A common misconception!

25
Stream Processing as the Primary Transformation Method
Cons
Agnostic of how and how often data arrives
Declarative and expressive - easier to audit
Better for determinism and reproducibility
Minimal latency
Unfamiliarity
Limited framework support
Pros
SELECT
TUMBLE_ROWTIME(order_time, INTERVAL '1' DAY),
order_id,
count(*) as num_shipments,
sum(shipped_quantity) as shipped_total
FROM order_shipments
GROUP BY TUMBLE(order_time, INTERVAL '1' DAY), order_id

26
Execution Environment
ODF is framework-agnostic
Strict versioning + Sandboxing ensures reproducibility

27
Data Sharing Implications
Data storage
⬝ Root - durable and highly available (e.g. S3, GS, HDFS)
⬝ Derivative - cheap, unreliable, or none at all
⬝ A form of caching
⬝ Can be fully reconstructed from metadata and root data
Metadata
⬝ A digital passport of data
⬝ Used for verifying data integrity
⬝ Validating work of another peer to spot malicious activity
⬝ Exchanged securely between peers

https://github.com/kamu-data/kamu-cli
kamu-cli

ODF coordinator
Single-binary app
⬝ Written in Rust!
Two prototype engines:
⬝ Apache Spark
⬝ Apache Flink
Convenience features:
⬝ SQL Shell
⬝ Notebooks
Alpha-quality but actively developed
29
kamu-cli

34
ODF Summary
Data pipeline designed around properties essential for collaboration
⬝ Addresses decades of stagnation in data management
A much stricter model
⬝ No API calls, no "black box" transformations
Requires a mindset shift
⬝ Redefining what data is and how we treat it
⬝ Embracing temporal problems

35
The Next Level of Data Democratization
One of the pillars of the Digital Democracy
and the next-generation decentralized IT
Supply of factual data for blockchain contracts
⬝ With no central authority
⬝ Resistant to malicious behavior
Foundation for data monetization
⬝ Publisher incentives
Data provider for next generation web & apps

36
References
kamu-cli: https://github.com/kamu-data/kamu-cli/
Open Data Fabric: http://opendatafabric.org/
Blog: https://www.kamu.dev/blog/
Email: smikhtoniuk@kamu.dev

37
Call for Action
Data is our modern age history book - it should be treated as such
⬝ Stop modifying and copying data
⬝ Don’t version data - use bitemporal modelling instead
Data publishers have to take ownership of reproducibility
⬝ Need good standards and tools
Non-temporal data is a local optima
⬝ Temporal data should be the default choice
Stream processing will displace batch
⬝ Let’s collaborate on improving tools not oversimplifying problems!

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Building a Distributed Collaborative Data Pipeline with Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Distributed Collaborative Data Pipeline with Apache Spark

Similar to Building a Distributed Collaborative Data Pipeline with Apache Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Building a Distributed Collaborative Data Pipeline with Apache Spark