SlideShare a Scribd company logo
Building a Distributed Collaborative
Data Pipeline with Apache Spark
Sergii Mikhtoniuk
Founder @ Kamu Data Inc.
2
Example: Hydroxychloroquine Study
Published in "The Lancet" journal in May
Data:
⬝ > 96,000 COVID-19 patients
⬝ 671 hospitals worldwide
⬝ Proprietary database: Surgisphere
Finding: Increased risk of in-hospital mortality
Global trials are halted completely
* https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31180-6/fulltext
3
Example: Hydroxychloroquine Study (cont.)
Publication retracted in June
⬝ After numerous data inconsistencies pointed out
⬝ Data provenance could not be established
⬝ Surgisphere database goes offline
A data management issue
⬝ Was ignored for years
⬝ Derailed life-saving efforts in critical time
Just one of many examples…
4
Reproducibility Crisis
Reproducibility is the foundation of the
scientific method
Nearly-impossible to achieve in practice
Problem originates in our data
management practices
* https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
Reproducibility & Verifiability
in Modern Data
Two Kinds of Data
Derivative DataSource Data
6
Generated by some system
⬝ Or a result of observations
Cannot be reconstructed if lost
Publishers
⬝ Have full authority
⬝ Responsible for validity
Created from other data via some
transformations
⬝ Aggregates, summaries, ML etc.
Can be reconstructed if lost
Validity depends on source data
7
Source Data: Expectation
Two unrelated parties at different times should be able to access the same data
and validate that it comes unaltered from the trusted source
Stable References Validation Mechanism
Source Data: Reality
Temporal DatasetsNon-Temporal Datasets
8
Store "current state" of the
domain
Destructive in-place updates
Constant loss of history
Store events / observations
In-place updates
Same URL - different data
9
Derivative Data: Expectation
Repeating all transformation steps produces same results as the original.
Having data, any party can verify that it was produced by declared transformations
without being accidentally or maliciously altered.
Transparency Determinism
10
Derivative Data: Reality
Reproducibility is a very manual process (workflows, compliance)
Requires:
⬝ Stable references to source data
⬝ Reproducible environments
⬝ Code, libraries, transitive dependencies, OS, hardware
⬝ Self-contained projects
⬝ No external API calls
⬝ Eliminating randomness
When time is pressing - it goes out of the window
11
Examples
Sharing data on Kaggle or Github
⬝ Goal: Collaborating on data cleaning
⬝ Reality: Cannot be trusted
Data hubs & portals
⬝ Goal: improve discoverability & break down silos
⬝ Reality: Validity cannot be established
Copy + Version approach
⬝ Works only within the enterprise "bubble"
⬝ Doesn’t work between independent parties
12
Summary
We are mismanaging data on the most fundamental level
Collaboration on data is impossible - forward progress is constantly lost
Reached the limit of workflows - need a technical solution
13
Is There a Better Way?
We set out to design a new kind of data supply chain designed from ground up for
Reproducibility & verifiability
Low latency
Complete provenance
Data reuse and collaboration
http://opendatafabric.org
ODF is a protocol specification for
reliable data exchange and
transformation
The world’s first P2P data pipeline!
15
Introducing Open Data Fabric
16
Strict Definition of "Data"
There is no such thing as "current state" - only history
⬝ Data is temporal
History doesn’t change
⬝ Data is immutable
Future decision making relies on history
⬝ Data must have infinite retention
Time is relative, information propagates with a delay
⬝ Use bitemporality to reflect that
17
Data Model
Dataset - a potentially infinite stream of events / observations
Every event has
⬝ Event time - when it occured in the outside world
⬝ System time - when it was first observed by the system (monotonic)
Stable Ref = (dataset_id, system_time)
Validity = (stable_ref, hash)
18
Data Flow
Dataset Manifests
Derivative DatasetRoot Dataset
19
20
Temporal Metadata
Metadata is a series of events
⬝ Stores everything that has
ever influenced the data
Key enabler of:
⬝ Reproducibility
⬝ Verifiability
⬝ Dataset evolution
⬝ Efficient data sharing
⬝ Fine-grained provenance (WIP)
21
Metadata Chain
Blockchain-like
Cryptographically secured
Linked to data
Extensible
⬝ Semantics / Ontology
⬝ Governance / Licensing
⬝ Privacy / Security
⬝ …
22
Dataset Layout
23
Batch processing is unfit for purpose!
⬝ Relies on simplicity of non-temporal data
⬝ Ignores temporal problems
⬝ Late / out-of-order arrivals
⬝ Backfills & corrections
⬝ Arrival cadence differences in joins
Using batch would be
⬝ Extremely error-prone
⬝ Impossible to audit
Goodbye Batch
24
Hello Stream Processing!
Is designed for temporal problems
⬝ Event-time processing (bitemporality)
⬝ Watermarks, Triggers
⬝ Windowing
⬝ Projections
⬝ Stream-to-Stream Joins
⬝ Projection (Temporal-Table) Joins
Useful only for near real-time processing?
⬝ A common misconception!
25
Stream Processing as the Primary Transformation Method
Cons
Agnostic of how and how often data arrives
Declarative and expressive - easier to audit
Better for determinism and reproducibility
Minimal latency
Unfamiliarity
Limited framework support
Pros
SELECT
TUMBLE_ROWTIME(order_time, INTERVAL '1' DAY),
order_id,
count(*) as num_shipments,
sum(shipped_quantity) as shipped_total
FROM order_shipments
GROUP BY TUMBLE(order_time, INTERVAL '1' DAY), order_id
26
Execution Environment
ODF is framework-agnostic
Strict versioning + Sandboxing ensures reproducibility
27
Data Sharing Implications
Data storage
⬝ Root - durable and highly available (e.g. S3, GS, HDFS)
⬝ Derivative - cheap, unreliable, or none at all
⬝ A form of caching
⬝ Can be fully reconstructed from metadata and root data
Metadata
⬝ A digital passport of data
⬝ Used for verifying data integrity
⬝ Validating work of another peer to spot malicious activity
⬝ Exchanged securely between peers
https://github.com/kamu-data/kamu-cli
kamu-cli
ODF coordinator
Single-binary app
⬝ Written in Rust!
Two prototype engines:
⬝ Apache Spark
⬝ Apache Flink
Convenience features:
⬝ SQL Shell
⬝ Notebooks
Alpha-quality but actively developed
29
kamu-cli
30
Git-like Interface
31
Spark SQL Shell
32
Notebooks
Conclusion
34
ODF Summary
Data pipeline designed around properties essential for collaboration
⬝ Addresses decades of stagnation in data management
A much stricter model
⬝ No API calls, no "black box" transformations
Requires a mindset shift
⬝ Redefining what data is and how we treat it
⬝ Embracing temporal problems
35
The Next Level of Data Democratization
One of the pillars of the Digital Democracy
and the next-generation decentralized IT
Supply of factual data for blockchain contracts
⬝ With no central authority
⬝ Resistant to malicious behavior
Foundation for data monetization
⬝ Publisher incentives
Data provider for next generation web & apps
36
References
kamu-cli: https://github.com/kamu-data/kamu-cli/
Open Data Fabric: http://opendatafabric.org/
Blog: https://www.kamu.dev/blog/
Email: smikhtoniuk@kamu.dev
37
Call for Action
Data is our modern age history book - it should be treated as such
⬝ Stop modifying and copying data
⬝ Don’t version data - use bitemporal modelling instead
Data publishers have to take ownership of reproducibility
⬝ Need good standards and tools
Non-temporal data is a local optima
⬝ Temporal data should be the default choice
Stream processing will displace batch
⬝ Let’s collaborate on improving tools not oversimplifying problems!
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Domain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data MeshDomain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data Mesh
confluent
 
Making the most of your Snowflake Investment
Making the most of your Snowflake InvestmentMaking the most of your Snowflake Investment
Making the most of your Snowflake Investment
Paul Van Siclen
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
Denodo
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Denodo
 
Data Virtualization: The Agile Delivery Platform
Data Virtualization: The Agile Delivery PlatformData Virtualization: The Agile Delivery Platform
Data Virtualization: The Agile Delivery Platform
Denodo
 
Denodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the CloudDenodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo
 
Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence Development
ManojKumarR41
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
Cloud Modernization with Data Virtualization
Cloud Modernization with Data VirtualizationCloud Modernization with Data Virtualization
Cloud Modernization with Data Virtualization
Denodo
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Denodo
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio
Denodo
 
Agile Data Management with Enterprise Data Fabric (ASEAN)
Agile Data Management with Enterprise Data Fabric (ASEAN)Agile Data Management with Enterprise Data Fabric (ASEAN)
Agile Data Management with Enterprise Data Fabric (ASEAN)
Denodo
 
Embedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven LogisticsEmbedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven Logistics
Databricks
 
Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019
Steven Moy
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
Denodo
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman Marya
Data Con LA
 

What's hot (20)

Domain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data MeshDomain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data Mesh
 
Making the most of your Snowflake Investment
Making the most of your Snowflake InvestmentMaking the most of your Snowflake Investment
Making the most of your Snowflake Investment
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
 
Data Virtualization: The Agile Delivery Platform
Data Virtualization: The Agile Delivery PlatformData Virtualization: The Agile Delivery Platform
Data Virtualization: The Agile Delivery Platform
 
Denodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the CloudDenodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the Cloud
 
Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence Development
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Cloud Modernization with Data Virtualization
Cloud Modernization with Data VirtualizationCloud Modernization with Data Virtualization
Cloud Modernization with Data Virtualization
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio
 
Agile Data Management with Enterprise Data Fabric (ASEAN)
Agile Data Management with Enterprise Data Fabric (ASEAN)Agile Data Management with Enterprise Data Fabric (ASEAN)
Agile Data Management with Enterprise Data Fabric (ASEAN)
 
Embedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven LogisticsEmbedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven Logistics
 
Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman Marya
 

Similar to Building a Distributed Collaborative Data Pipeline with Apache Spark

Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
Future of Data Strategy
Future of Data StrategyFuture of Data Strategy
Future of Data Strategy
Denodo
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
Denodo
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
HostedbyConfluent
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover
 
Burton - Security, Privacy and Trust
Burton - Security, Privacy and TrustBurton - Security, Privacy and Trust
Burton - Security, Privacy and Trust
National Information Standards Organization (NISO)
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
eGov Innovation Center
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
Denodo
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
Sanjay Padhi, Ph.D
 
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
Denodo
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
jkvr
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
Embarcadero Technologies
 

Similar to Building a Distributed Collaborative Data Pipeline with Apache Spark (20)

Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Future of Data Strategy
Future of Data StrategyFuture of Data Strategy
Future of Data Strategy
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Burton - Security, Privacy and Trust
Burton - Security, Privacy and TrustBurton - Security, Privacy and Trust
Burton - Security, Privacy and Trust
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 

Building a Distributed Collaborative Data Pipeline with Apache Spark

  • 1. Building a Distributed Collaborative Data Pipeline with Apache Spark Sergii Mikhtoniuk Founder @ Kamu Data Inc.
  • 2. 2 Example: Hydroxychloroquine Study Published in "The Lancet" journal in May Data: ⬝ > 96,000 COVID-19 patients ⬝ 671 hospitals worldwide ⬝ Proprietary database: Surgisphere Finding: Increased risk of in-hospital mortality Global trials are halted completely * https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31180-6/fulltext
  • 3. 3 Example: Hydroxychloroquine Study (cont.) Publication retracted in June ⬝ After numerous data inconsistencies pointed out ⬝ Data provenance could not be established ⬝ Surgisphere database goes offline A data management issue ⬝ Was ignored for years ⬝ Derailed life-saving efforts in critical time Just one of many examples…
  • 4. 4 Reproducibility Crisis Reproducibility is the foundation of the scientific method Nearly-impossible to achieve in practice Problem originates in our data management practices * https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
  • 6. Two Kinds of Data Derivative DataSource Data 6 Generated by some system ⬝ Or a result of observations Cannot be reconstructed if lost Publishers ⬝ Have full authority ⬝ Responsible for validity Created from other data via some transformations ⬝ Aggregates, summaries, ML etc. Can be reconstructed if lost Validity depends on source data
  • 7. 7 Source Data: Expectation Two unrelated parties at different times should be able to access the same data and validate that it comes unaltered from the trusted source Stable References Validation Mechanism
  • 8. Source Data: Reality Temporal DatasetsNon-Temporal Datasets 8 Store "current state" of the domain Destructive in-place updates Constant loss of history Store events / observations In-place updates Same URL - different data
  • 9. 9 Derivative Data: Expectation Repeating all transformation steps produces same results as the original. Having data, any party can verify that it was produced by declared transformations without being accidentally or maliciously altered. Transparency Determinism
  • 10. 10 Derivative Data: Reality Reproducibility is a very manual process (workflows, compliance) Requires: ⬝ Stable references to source data ⬝ Reproducible environments ⬝ Code, libraries, transitive dependencies, OS, hardware ⬝ Self-contained projects ⬝ No external API calls ⬝ Eliminating randomness When time is pressing - it goes out of the window
  • 11. 11 Examples Sharing data on Kaggle or Github ⬝ Goal: Collaborating on data cleaning ⬝ Reality: Cannot be trusted Data hubs & portals ⬝ Goal: improve discoverability & break down silos ⬝ Reality: Validity cannot be established Copy + Version approach ⬝ Works only within the enterprise "bubble" ⬝ Doesn’t work between independent parties
  • 12. 12 Summary We are mismanaging data on the most fundamental level Collaboration on data is impossible - forward progress is constantly lost Reached the limit of workflows - need a technical solution
  • 13. 13 Is There a Better Way? We set out to design a new kind of data supply chain designed from ground up for Reproducibility & verifiability Low latency Complete provenance Data reuse and collaboration
  • 15. ODF is a protocol specification for reliable data exchange and transformation The world’s first P2P data pipeline! 15 Introducing Open Data Fabric
  • 16. 16 Strict Definition of "Data" There is no such thing as "current state" - only history ⬝ Data is temporal History doesn’t change ⬝ Data is immutable Future decision making relies on history ⬝ Data must have infinite retention Time is relative, information propagates with a delay ⬝ Use bitemporality to reflect that
  • 17. 17 Data Model Dataset - a potentially infinite stream of events / observations Every event has ⬝ Event time - when it occured in the outside world ⬝ System time - when it was first observed by the system (monotonic) Stable Ref = (dataset_id, system_time) Validity = (stable_ref, hash)
  • 20. 20 Temporal Metadata Metadata is a series of events ⬝ Stores everything that has ever influenced the data Key enabler of: ⬝ Reproducibility ⬝ Verifiability ⬝ Dataset evolution ⬝ Efficient data sharing ⬝ Fine-grained provenance (WIP)
  • 21. 21 Metadata Chain Blockchain-like Cryptographically secured Linked to data Extensible ⬝ Semantics / Ontology ⬝ Governance / Licensing ⬝ Privacy / Security ⬝ …
  • 23. 23 Batch processing is unfit for purpose! ⬝ Relies on simplicity of non-temporal data ⬝ Ignores temporal problems ⬝ Late / out-of-order arrivals ⬝ Backfills & corrections ⬝ Arrival cadence differences in joins Using batch would be ⬝ Extremely error-prone ⬝ Impossible to audit Goodbye Batch
  • 24. 24 Hello Stream Processing! Is designed for temporal problems ⬝ Event-time processing (bitemporality) ⬝ Watermarks, Triggers ⬝ Windowing ⬝ Projections ⬝ Stream-to-Stream Joins ⬝ Projection (Temporal-Table) Joins Useful only for near real-time processing? ⬝ A common misconception!
  • 25. 25 Stream Processing as the Primary Transformation Method Cons Agnostic of how and how often data arrives Declarative and expressive - easier to audit Better for determinism and reproducibility Minimal latency Unfamiliarity Limited framework support Pros SELECT TUMBLE_ROWTIME(order_time, INTERVAL '1' DAY), order_id, count(*) as num_shipments, sum(shipped_quantity) as shipped_total FROM order_shipments GROUP BY TUMBLE(order_time, INTERVAL '1' DAY), order_id
  • 26. 26 Execution Environment ODF is framework-agnostic Strict versioning + Sandboxing ensures reproducibility
  • 27. 27 Data Sharing Implications Data storage ⬝ Root - durable and highly available (e.g. S3, GS, HDFS) ⬝ Derivative - cheap, unreliable, or none at all ⬝ A form of caching ⬝ Can be fully reconstructed from metadata and root data Metadata ⬝ A digital passport of data ⬝ Used for verifying data integrity ⬝ Validating work of another peer to spot malicious activity ⬝ Exchanged securely between peers
  • 29. ODF coordinator Single-binary app ⬝ Written in Rust! Two prototype engines: ⬝ Apache Spark ⬝ Apache Flink Convenience features: ⬝ SQL Shell ⬝ Notebooks Alpha-quality but actively developed 29 kamu-cli
  • 34. 34 ODF Summary Data pipeline designed around properties essential for collaboration ⬝ Addresses decades of stagnation in data management A much stricter model ⬝ No API calls, no "black box" transformations Requires a mindset shift ⬝ Redefining what data is and how we treat it ⬝ Embracing temporal problems
  • 35. 35 The Next Level of Data Democratization One of the pillars of the Digital Democracy and the next-generation decentralized IT Supply of factual data for blockchain contracts ⬝ With no central authority ⬝ Resistant to malicious behavior Foundation for data monetization ⬝ Publisher incentives Data provider for next generation web & apps
  • 36. 36 References kamu-cli: https://github.com/kamu-data/kamu-cli/ Open Data Fabric: http://opendatafabric.org/ Blog: https://www.kamu.dev/blog/ Email: smikhtoniuk@kamu.dev
  • 37. 37 Call for Action Data is our modern age history book - it should be treated as such ⬝ Stop modifying and copying data ⬝ Don’t version data - use bitemporal modelling instead Data publishers have to take ownership of reproducibility ⬝ Need good standards and tools Non-temporal data is a local optima ⬝ Temporal data should be the default choice Stream processing will displace batch ⬝ Let’s collaborate on improving tools not oversimplifying problems!
  • 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.