SlideShare a Scribd company logo
1 of 38
Download to read offline
Building a Distributed Collaborative
Data Pipeline with Apache Spark
Sergii Mikhtoniuk
Founder @ Kamu Data Inc.
2
Example: Hydroxychloroquine Study
Published in "The Lancet" journal in May
Data:
⬝ > 96,000 COVID-19 patients
⬝ 671 hospitals worldwide
⬝ Proprietary database: Surgisphere
Finding: Increased risk of in-hospital mortality
Global trials are halted completely
* https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31180-6/fulltext
3
Example: Hydroxychloroquine Study (cont.)
Publication retracted in June
⬝ After numerous data inconsistencies pointed out
⬝ Data provenance could not be established
⬝ Surgisphere database goes offline
A data management issue
⬝ Was ignored for years
⬝ Derailed life-saving efforts in critical time
Just one of many examples…
4
Reproducibility Crisis
Reproducibility is the foundation of the
scientific method
Nearly-impossible to achieve in practice
Problem originates in our data
management practices
* https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
Reproducibility & Verifiability
in Modern Data
Two Kinds of Data
Derivative DataSource Data
6
Generated by some system
⬝ Or a result of observations
Cannot be reconstructed if lost
Publishers
⬝ Have full authority
⬝ Responsible for validity
Created from other data via some
transformations
⬝ Aggregates, summaries, ML etc.
Can be reconstructed if lost
Validity depends on source data
7
Source Data: Expectation
Two unrelated parties at different times should be able to access the same data
and validate that it comes unaltered from the trusted source
Stable References Validation Mechanism
Source Data: Reality
Temporal DatasetsNon-Temporal Datasets
8
Store "current state" of the
domain
Destructive in-place updates
Constant loss of history
Store events / observations
In-place updates
Same URL - different data
9
Derivative Data: Expectation
Repeating all transformation steps produces same results as the original.
Having data, any party can verify that it was produced by declared transformations
without being accidentally or maliciously altered.
Transparency Determinism
10
Derivative Data: Reality
Reproducibility is a very manual process (workflows, compliance)
Requires:
⬝ Stable references to source data
⬝ Reproducible environments
⬝ Code, libraries, transitive dependencies, OS, hardware
⬝ Self-contained projects
⬝ No external API calls
⬝ Eliminating randomness
When time is pressing - it goes out of the window
11
Examples
Sharing data on Kaggle or Github
⬝ Goal: Collaborating on data cleaning
⬝ Reality: Cannot be trusted
Data hubs & portals
⬝ Goal: improve discoverability & break down silos
⬝ Reality: Validity cannot be established
Copy + Version approach
⬝ Works only within the enterprise "bubble"
⬝ Doesn’t work between independent parties
12
Summary
We are mismanaging data on the most fundamental level
Collaboration on data is impossible - forward progress is constantly lost
Reached the limit of workflows - need a technical solution
13
Is There a Better Way?
We set out to design a new kind of data supply chain designed from ground up for
Reproducibility & verifiability
Low latency
Complete provenance
Data reuse and collaboration
http://opendatafabric.org
ODF is a protocol specification for
reliable data exchange and
transformation
The world’s first P2P data pipeline!
15
Introducing Open Data Fabric
16
Strict Definition of "Data"
There is no such thing as "current state" - only history
⬝ Data is temporal
History doesn’t change
⬝ Data is immutable
Future decision making relies on history
⬝ Data must have infinite retention
Time is relative, information propagates with a delay
⬝ Use bitemporality to reflect that
17
Data Model
Dataset - a potentially infinite stream of events / observations
Every event has
⬝ Event time - when it occured in the outside world
⬝ System time - when it was first observed by the system (monotonic)
Stable Ref = (dataset_id, system_time)
Validity = (stable_ref, hash)
18
Data Flow
Dataset Manifests
Derivative DatasetRoot Dataset
19
20
Temporal Metadata
Metadata is a series of events
⬝ Stores everything that has
ever influenced the data
Key enabler of:
⬝ Reproducibility
⬝ Verifiability
⬝ Dataset evolution
⬝ Efficient data sharing
⬝ Fine-grained provenance (WIP)
21
Metadata Chain
Blockchain-like
Cryptographically secured
Linked to data
Extensible
⬝ Semantics / Ontology
⬝ Governance / Licensing
⬝ Privacy / Security
⬝ …
22
Dataset Layout
23
Batch processing is unfit for purpose!
⬝ Relies on simplicity of non-temporal data
⬝ Ignores temporal problems
⬝ Late / out-of-order arrivals
⬝ Backfills & corrections
⬝ Arrival cadence differences in joins
Using batch would be
⬝ Extremely error-prone
⬝ Impossible to audit
Goodbye Batch
24
Hello Stream Processing!
Is designed for temporal problems
⬝ Event-time processing (bitemporality)
⬝ Watermarks, Triggers
⬝ Windowing
⬝ Projections
⬝ Stream-to-Stream Joins
⬝ Projection (Temporal-Table) Joins
Useful only for near real-time processing?
⬝ A common misconception!
25
Stream Processing as the Primary Transformation Method
Cons
Agnostic of how and how often data arrives
Declarative and expressive - easier to audit
Better for determinism and reproducibility
Minimal latency
Unfamiliarity
Limited framework support
Pros
SELECT
TUMBLE_ROWTIME(order_time, INTERVAL '1' DAY),
order_id,
count(*) as num_shipments,
sum(shipped_quantity) as shipped_total
FROM order_shipments
GROUP BY TUMBLE(order_time, INTERVAL '1' DAY), order_id
26
Execution Environment
ODF is framework-agnostic
Strict versioning + Sandboxing ensures reproducibility
27
Data Sharing Implications
Data storage
⬝ Root - durable and highly available (e.g. S3, GS, HDFS)
⬝ Derivative - cheap, unreliable, or none at all
⬝ A form of caching
⬝ Can be fully reconstructed from metadata and root data
Metadata
⬝ A digital passport of data
⬝ Used for verifying data integrity
⬝ Validating work of another peer to spot malicious activity
⬝ Exchanged securely between peers
https://github.com/kamu-data/kamu-cli
kamu-cli
ODF coordinator
Single-binary app
⬝ Written in Rust!
Two prototype engines:
⬝ Apache Spark
⬝ Apache Flink
Convenience features:
⬝ SQL Shell
⬝ Notebooks
Alpha-quality but actively developed
29
kamu-cli
30
Git-like Interface
31
Spark SQL Shell
32
Notebooks
Conclusion
34
ODF Summary
Data pipeline designed around properties essential for collaboration
⬝ Addresses decades of stagnation in data management
A much stricter model
⬝ No API calls, no "black box" transformations
Requires a mindset shift
⬝ Redefining what data is and how we treat it
⬝ Embracing temporal problems
35
The Next Level of Data Democratization
One of the pillars of the Digital Democracy
and the next-generation decentralized IT
Supply of factual data for blockchain contracts
⬝ With no central authority
⬝ Resistant to malicious behavior
Foundation for data monetization
⬝ Publisher incentives
Data provider for next generation web & apps
36
References
kamu-cli: https://github.com/kamu-data/kamu-cli/
Open Data Fabric: http://opendatafabric.org/
Blog: https://www.kamu.dev/blog/
Email: smikhtoniuk@kamu.dev
37
Call for Action
Data is our modern age history book - it should be treated as such
⬝ Stop modifying and copying data
⬝ Don’t version data - use bitemporal modelling instead
Data publishers have to take ownership of reproducibility
⬝ Need good standards and tools
Non-temporal data is a local optima
⬝ Temporal data should be the default choice
Stream processing will displace batch
⬝ Let’s collaborate on improving tools not oversimplifying problems!
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Domain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data MeshDomain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data Meshconfluent
 
Making the most of your Snowflake Investment
Making the most of your Snowflake InvestmentMaking the most of your Snowflake Investment
Making the most of your Snowflake InvestmentPaul Van Siclen
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Denodo
 
Data Virtualization: The Agile Delivery Platform
Data Virtualization: The Agile Delivery PlatformData Virtualization: The Agile Delivery Platform
Data Virtualization: The Agile Delivery PlatformDenodo
 
Denodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the CloudDenodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the CloudDenodo
 
Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence DevelopmentManojKumarR41
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
 
Cloud Modernization with Data Virtualization
Cloud Modernization with Data VirtualizationCloud Modernization with Data Virtualization
Cloud Modernization with Data VirtualizationDenodo
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Denodo
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo
 
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your PortfolioDenodo
 
Agile Data Management with Enterprise Data Fabric (ASEAN)
Agile Data Management with Enterprise Data Fabric (ASEAN)Agile Data Management with Enterprise Data Fabric (ASEAN)
Agile Data Management with Enterprise Data Fabric (ASEAN)Denodo
 
Embedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven LogisticsEmbedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven LogisticsDatabricks
 
Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Steven Moy
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An IntroductionDenodo
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaData Con LA
 

What's hot (20)

Domain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data MeshDomain Driven Data: Apache Kafka® and the Data Mesh
Domain Driven Data: Apache Kafka® and the Data Mesh
 
Making the most of your Snowflake Investment
Making the most of your Snowflake InvestmentMaking the most of your Snowflake Investment
Making the most of your Snowflake Investment
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
 
Data Virtualization: The Agile Delivery Platform
Data Virtualization: The Agile Delivery PlatformData Virtualization: The Agile Delivery Platform
Data Virtualization: The Agile Delivery Platform
 
Denodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the CloudDenodo DataFest 2016: Big Data Virtualization in the Cloud
Denodo DataFest 2016: Big Data Virtualization in the Cloud
 
Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence Development
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Cloud Modernization with Data Virtualization
Cloud Modernization with Data VirtualizationCloud Modernization with Data Virtualization
Cloud Modernization with Data Virtualization
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo DataFest 2017: Outpace Your Competition with Real-Time Responses
Denodo DataFest 2017: Outpace Your Competition with Real-Time Responses
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio
 
Agile Data Management with Enterprise Data Fabric (ASEAN)
Agile Data Management with Enterprise Data Fabric (ASEAN)Agile Data Management with Enterprise Data Fabric (ASEAN)
Agile Data Management with Enterprise Data Fabric (ASEAN)
 
Embedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven LogisticsEmbedding Insight through Prediction Driven Logistics
Embedding Insight through Prediction Driven Logistics
 
Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019Data Mesh @ Yelp - 2019
Data Mesh @ Yelp - 2019
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman Marya
 

Similar to Building a Distributed Collaborative Data Pipeline with Apache Spark

Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Future of Data Strategy
Future of Data StrategyFuture of Data Strategy
Future of Data StrategyDenodo
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Denodo
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?Denodo
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)Denodo
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overviewjkvr
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsEmbarcadero Technologies
 

Similar to Building a Distributed Collaborative Data Pipeline with Apache Spark (20)

Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Future of Data Strategy
Future of Data StrategyFuture of Data Strategy
Future of Data Strategy
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Burton - Security, Privacy and Trust
Burton - Security, Privacy and TrustBurton - Security, Privacy and Trust
Burton - Security, Privacy and Trust
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Recently uploaded (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

Building a Distributed Collaborative Data Pipeline with Apache Spark

  • 1. Building a Distributed Collaborative Data Pipeline with Apache Spark Sergii Mikhtoniuk Founder @ Kamu Data Inc.
  • 2. 2 Example: Hydroxychloroquine Study Published in "The Lancet" journal in May Data: ⬝ > 96,000 COVID-19 patients ⬝ 671 hospitals worldwide ⬝ Proprietary database: Surgisphere Finding: Increased risk of in-hospital mortality Global trials are halted completely * https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31180-6/fulltext
  • 3. 3 Example: Hydroxychloroquine Study (cont.) Publication retracted in June ⬝ After numerous data inconsistencies pointed out ⬝ Data provenance could not be established ⬝ Surgisphere database goes offline A data management issue ⬝ Was ignored for years ⬝ Derailed life-saving efforts in critical time Just one of many examples…
  • 4. 4 Reproducibility Crisis Reproducibility is the foundation of the scientific method Nearly-impossible to achieve in practice Problem originates in our data management practices * https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
  • 6. Two Kinds of Data Derivative DataSource Data 6 Generated by some system ⬝ Or a result of observations Cannot be reconstructed if lost Publishers ⬝ Have full authority ⬝ Responsible for validity Created from other data via some transformations ⬝ Aggregates, summaries, ML etc. Can be reconstructed if lost Validity depends on source data
  • 7. 7 Source Data: Expectation Two unrelated parties at different times should be able to access the same data and validate that it comes unaltered from the trusted source Stable References Validation Mechanism
  • 8. Source Data: Reality Temporal DatasetsNon-Temporal Datasets 8 Store "current state" of the domain Destructive in-place updates Constant loss of history Store events / observations In-place updates Same URL - different data
  • 9. 9 Derivative Data: Expectation Repeating all transformation steps produces same results as the original. Having data, any party can verify that it was produced by declared transformations without being accidentally or maliciously altered. Transparency Determinism
  • 10. 10 Derivative Data: Reality Reproducibility is a very manual process (workflows, compliance) Requires: ⬝ Stable references to source data ⬝ Reproducible environments ⬝ Code, libraries, transitive dependencies, OS, hardware ⬝ Self-contained projects ⬝ No external API calls ⬝ Eliminating randomness When time is pressing - it goes out of the window
  • 11. 11 Examples Sharing data on Kaggle or Github ⬝ Goal: Collaborating on data cleaning ⬝ Reality: Cannot be trusted Data hubs & portals ⬝ Goal: improve discoverability & break down silos ⬝ Reality: Validity cannot be established Copy + Version approach ⬝ Works only within the enterprise "bubble" ⬝ Doesn’t work between independent parties
  • 12. 12 Summary We are mismanaging data on the most fundamental level Collaboration on data is impossible - forward progress is constantly lost Reached the limit of workflows - need a technical solution
  • 13. 13 Is There a Better Way? We set out to design a new kind of data supply chain designed from ground up for Reproducibility & verifiability Low latency Complete provenance Data reuse and collaboration
  • 15. ODF is a protocol specification for reliable data exchange and transformation The world’s first P2P data pipeline! 15 Introducing Open Data Fabric
  • 16. 16 Strict Definition of "Data" There is no such thing as "current state" - only history ⬝ Data is temporal History doesn’t change ⬝ Data is immutable Future decision making relies on history ⬝ Data must have infinite retention Time is relative, information propagates with a delay ⬝ Use bitemporality to reflect that
  • 17. 17 Data Model Dataset - a potentially infinite stream of events / observations Every event has ⬝ Event time - when it occured in the outside world ⬝ System time - when it was first observed by the system (monotonic) Stable Ref = (dataset_id, system_time) Validity = (stable_ref, hash)
  • 20. 20 Temporal Metadata Metadata is a series of events ⬝ Stores everything that has ever influenced the data Key enabler of: ⬝ Reproducibility ⬝ Verifiability ⬝ Dataset evolution ⬝ Efficient data sharing ⬝ Fine-grained provenance (WIP)
  • 21. 21 Metadata Chain Blockchain-like Cryptographically secured Linked to data Extensible ⬝ Semantics / Ontology ⬝ Governance / Licensing ⬝ Privacy / Security ⬝ …
  • 23. 23 Batch processing is unfit for purpose! ⬝ Relies on simplicity of non-temporal data ⬝ Ignores temporal problems ⬝ Late / out-of-order arrivals ⬝ Backfills & corrections ⬝ Arrival cadence differences in joins Using batch would be ⬝ Extremely error-prone ⬝ Impossible to audit Goodbye Batch
  • 24. 24 Hello Stream Processing! Is designed for temporal problems ⬝ Event-time processing (bitemporality) ⬝ Watermarks, Triggers ⬝ Windowing ⬝ Projections ⬝ Stream-to-Stream Joins ⬝ Projection (Temporal-Table) Joins Useful only for near real-time processing? ⬝ A common misconception!
  • 25. 25 Stream Processing as the Primary Transformation Method Cons Agnostic of how and how often data arrives Declarative and expressive - easier to audit Better for determinism and reproducibility Minimal latency Unfamiliarity Limited framework support Pros SELECT TUMBLE_ROWTIME(order_time, INTERVAL '1' DAY), order_id, count(*) as num_shipments, sum(shipped_quantity) as shipped_total FROM order_shipments GROUP BY TUMBLE(order_time, INTERVAL '1' DAY), order_id
  • 26. 26 Execution Environment ODF is framework-agnostic Strict versioning + Sandboxing ensures reproducibility
  • 27. 27 Data Sharing Implications Data storage ⬝ Root - durable and highly available (e.g. S3, GS, HDFS) ⬝ Derivative - cheap, unreliable, or none at all ⬝ A form of caching ⬝ Can be fully reconstructed from metadata and root data Metadata ⬝ A digital passport of data ⬝ Used for verifying data integrity ⬝ Validating work of another peer to spot malicious activity ⬝ Exchanged securely between peers
  • 29. ODF coordinator Single-binary app ⬝ Written in Rust! Two prototype engines: ⬝ Apache Spark ⬝ Apache Flink Convenience features: ⬝ SQL Shell ⬝ Notebooks Alpha-quality but actively developed 29 kamu-cli
  • 34. 34 ODF Summary Data pipeline designed around properties essential for collaboration ⬝ Addresses decades of stagnation in data management A much stricter model ⬝ No API calls, no "black box" transformations Requires a mindset shift ⬝ Redefining what data is and how we treat it ⬝ Embracing temporal problems
  • 35. 35 The Next Level of Data Democratization One of the pillars of the Digital Democracy and the next-generation decentralized IT Supply of factual data for blockchain contracts ⬝ With no central authority ⬝ Resistant to malicious behavior Foundation for data monetization ⬝ Publisher incentives Data provider for next generation web & apps
  • 36. 36 References kamu-cli: https://github.com/kamu-data/kamu-cli/ Open Data Fabric: http://opendatafabric.org/ Blog: https://www.kamu.dev/blog/ Email: smikhtoniuk@kamu.dev
  • 37. 37 Call for Action Data is our modern age history book - it should be treated as such ⬝ Stop modifying and copying data ⬝ Don’t version data - use bitemporal modelling instead Data publishers have to take ownership of reproducibility ⬝ Need good standards and tools Non-temporal data is a local optima ⬝ Temporal data should be the default choice Stream processing will displace batch ⬝ Let’s collaborate on improving tools not oversimplifying problems!
  • 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.