SlideShare a Scribd company logo
TileDB webinars
Debunking “purpose-built data systems”:
Enter the Universal Database
Founder & CEO of TileDB, Inc.
Dr. Stavros Papadopoulos
Who is this webinar for?
You are building data(base) systems
You are using data(base) systems
… but you swim in a sea of different data tools and file formats
You want to store, analyze and share diverse data at scale ...
At data(base) companies or in-house
At a large enterprise team, scientific organization or independently
Disclaimer
We are not delusional, we know what we are proposing is that audacious
I am the exclusive recipient of complaints
Email me at: stavros@tiledb.com
All the credit for our amazing work goes to our powerful team
Check it out at https://tiledb.com/about
Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
35 employees with domain experts across all applications and domains
Who we are
TileDB got spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS
The Problem with “Purpose-Built Data Systems”
The Definition of the Universal Database
What the Database Community Missed
Architectural Decisions in TileDB
Proving Feasibility: Some Use Cases
The Future of Data Management
Agenda
with purpose-built data systems
The problem
And then there was light
We built a lot of sophistication around relational databases
Relational algebra + SQL
Roles, constraints, etc.
Row-stores vs. column stores
OLTP (transactional) vs. OLAP (warehouses)
Shared-nothing architectures
… and a lot more
Relational databases worked beautifully for tabular data
Big Data & the Sciences
In the meantime, the sciences have been generating immense amounts of data
This data is too big for traditional database architectures
This data cannot be represented very well with tables
Other database flavors (e.g., document) are not ideal either
Not all scientists like “databases” :)
Genomics
Imaging (satellite / biomedical)
LiDAR / SONAR / AIS
Weather
… and many more
The Cloud effect
As storage and compute got out of hand, cloud increased in popularity
Separation of storage and compute was inevitable
Cloud object stores were the obvious (cheapest) choice
Old database architectures did not work off the shelf
New paradigm: “lake houses”
Store all data as (pretty much) flat files on cloud stores
Use a “hammer” (computational framework) to scale compute
Treat data management as an afterthought and adopt “hacks”
The Machine Learning hype
Everybody wants to jump on the next sexy thing
Many new great frameworks and tools around ML
People started to like coding and building new great things
ML facilitated the advent of “Data Science”
Everyone thought that ML is a “compute” problem
In reality, ML is a data management problem
But there was an important mistake
And then there was mess
Too many file formats and disparate files lying around in cloud buckets
A metadata hell gave rise to “metadata systems”
Data sharing became complex and gave rise to “governance systems”
ML gave rise to numerous “feature / model stores”
Thousands of “data” / “ML” companies and open-source tools got created
Cloud vendors keep on pitching you hundreds of tools with funny names
“Data management” (and ML
became the noisiest problem space!
The Problem in a nutshell
Organizations lose time and money
Science is being slowed down
Organizations working on important problems are lost in the noise
Use a combination of numerous data systems, difficult to orchestrate
Or, build their own in-house solutions
There is tons of re-invention of the wheel along the way
Huge engineering overlap across domains and solutions
Scientists spend most of their time as data engineers
The definition of the
Universal database
What makes a database Universal
A single system with efficient support for
all types of data (including ML
all types of metadata and catalogs
Authentication, access control, security, logging
all types of computations (SQL, ML, Linear Algebra, custom)
Global sharing, collaboration and monetization
all storage backends and computer hardware
all language APIs and tool integrations
“Infinite” computational scale
Benefits of the Universal Database
Future-proofness
Don’t build a new system, extend your existing one
No. More. Noise.
Single data platform to solve problems with diverse data and analyses
Single data platform for authentication, access control and auditing
Easy, global-scale collaboration on data and (runnable) code
Superb extensibility via APIs and other internal abstractions
Modularity and API standardization
Facilitates user creativity, preventing reinvention of the wheel
as a database community
What we missed
Why no one had built it
We are stuck in an echo chamber
All cloud vendor marketing campaigns are around purpose-built systems
Some purpose-built systems had success
Universality intuitively seems like a long shot (and a LOT of work)
Tons of funding currently poured on incremental data solutions
The most promising data structure got overlooked!
Other solutions used it without traction
Arrays were never used to their fullest potential
This structure is the multi-dimensional array
How arrays were used
Each cell is uniquely identified by
integral coordinates
Every cell has a value
This is called a dense array
Pretty good for storing images, video, …
… but not good for tables and many more!
A multi-dimensional object comprised of cells
Pros of dense arrays
Dense array engines provide fast ND slicing
Dense arrays do not materialize coordinates
Slicing is done via zero copy close to optimally
In a table, we’d need to store the coordinates
Then a SQL WHERE condition on coordinates
Waste of space and query too slow
Dense array engine Tabular format + SQL
Cons of dense arrays
No string dimensions
No real dimensions
No heterogeneous dimensions
No cell multiplicities
Hence, definitely no table support
No efficient sparsity support
A dense array database is not enough
Too limited if it can’t store tables and other sparse data
Remember, not everyone likes “databases” (the way they are perceived today)
Scientists opted for array storage engines and custom formats / libraries
Therefore, they missed out on the other important DB features
Full circle, back to the mess
The Lost Secret Sauce | The Data Model
Heterogeneous dimensions (plus strings and reals)
Cell multiplicities
Arbitrary metadata
Dense array
In addition to dense arrays
Native support for sparse arrays
Sparse array
The Lost Secret Sauce | The Data Model
Arrays give you a flexible way to lay out the data on a 1D medium
Arrays also allow you to chunk (or tile) your data and compress, encrypt, etc.
Arrays provide rapid slicing from the 1D space, preserving the ND locality
End result:
Efficient, unified storage abstraction
Can now build APIs, access control, versioning, compute framework, etc.
The Lost Secret Sauce | The Data Model
Sparse array
Dense vector
Dataframe
Arrays subsume dataframes
The Lost Secret Sauce | The Data Model
What else can be modeled as an array
LiDAR 3D sparse)
SAR 2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense)
Even flat files!!! 1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
The Lost Secret Sauce | The Compute Model
Take any algorithm from any application domain and remove the jargon
This algorithm can be split into one or more steps (the tasks)
Each task typically operates on a slice of data
This task graph engine should be part of
the database with an exposed API
Some tasks can work in parallel, some cannot (due to dependencies)
We can build a single task graph engine, for any arbitrary compute
The Lost Secret Sauce | Extensibility
Data slicing and each task should be performed in any language
A good bet is to build as much as possible in C
Although 90% of the data management code is the same across applications
Each scientist has their own favorite language and tool
APIs should still be written in the application jargon
Should support multiple backends and computer hardware (existing and new)
There is no chance that the database can offer all operations built-in
Operations should be crowdsourced and shared
The database should provide the infrastructure to facilitate that
The Lost Secret Sauce | Summary
The foundation for a universal database:
array data model + generic task graphs + extreme extensibility
we took in TileDB
Architectural decisions
TileDB Cloud
❏ Access control and logging
❏ Serverless SQL, UDFs, task graphs
❏ Jupyter notebooks and dashboards
Unified data management
and easy serverless compute
at global scale
How we built a Universal Database
Pluggable Compute: Efficient APIs & Tool Integrations
TileDB Embedded
Open-source interoperable
storage with a universal
open-spec array format
❏ Parallel IO, rapid reads & writes
❏ Columnar, cloud-optimized
❏ Data versioning & time traveling
TileDB Embedded
Any data Any tool
Any backend
.las
.cog
.vcf
.csv
Universal storage based on ND arrays
Superior
performance
Built in C
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded
https://github.com/TileDBInc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
Extreme
interoperability
Numerous APIs
Numerous integrations
All backends
Optimized
for the cloud
Immutable writes
Parallel IO
Minimization of requests
TileDB Embedded
https://github.com/TileDBInc/TileDB
Open source:
TileDB Cloud
Universal storage Universal tooling
Universal data
.las .cog .vcf .csv
Universal scale
Management. Collaboration. Scalability
TileDB Cloud
Works as SaaS https://cloud.tiledb.com
Works on premises
Currently on AWS, soon on any cloud
Built to work anywhere
Slicing, SQL, UDFs, task graphs
It is completely serverless
On-demand JupyterHub instances
Can launch Jupyter notebooks
Compute sent to the data
It is geo-aware
Authentication, compliance, etc.
It is secure
TileDB Cloud
Full marketplace (via Stripe)
Everything is monetizable
Access control inside and outside your
organization
Make any data and code public
Discover any public data and code
(central catalog)
Everything is shareable at global scale
Jupyter notebooks
UDFs and task graphs
ML models
Everything is an array!
Dashboards (e.g., R shiny apps)
All types of data (even flat files)
Full auditability (data, code, any action)
Everything is logged
Some use cases
Proving feasibility
Anything tabular
Cloud-optimized storage
Fast multi-column slicing
Integrations with MariaDB, Presto/Trino, Spark, pandas
Serverless SQL on TileDB Cloud
Flexibility in building and sharing distributed SQL
Anything ML
Model management (storage and sharing)
Integration with TensorFlow, PyTorch and more
Flexibility in building arbitrary pipelines
Native data management (access control, logging)
Scalability for training and servicing models
Anything Geospatial
Point cloud (LiDAR, SONAR, AIS
Weather
Raster
Hyperspectral imaging
All serverless
Extreme scalability
Tool integrations
Data and code sharing & monetization
SAR (temporal stacks)
Genomics
Population genomics
All serverless
Extreme scalability
Tool integrations
Collaboration and reproducibility
Single-cell genomics
Performance at a low cost
Marketplaces
Monetize any data & any code
No data duplication and movement
In-platform analytics
No infrastructure management
Flexible pay-as-you-go model
Communities
Share your work, learn from others, promote science
A massive catalog of analysis-ready datasets
A massive catalog of runnable code
Collaboration and reproducibility
of Data Management
The future
Prediction
Universal databases
support tables and SQL
work on the cloud
If universal databases are proven to work, they will subsume warehouses and lake houses
convert all file formats to arrays
offer scalable compute
support custom user code
support anything ML
How I view the future
The future of data management is you!
Stop building a new system for every single “twist”
Build different components within some universal database
Eventually even stop using term “universal”
All databases must be universal by default
Focus energy on Science, not unnecessary Engineering
Build a massive collaborative data community
Enable brilliance
The Universal Database
Thank you
WE ARE HIRING
Apply at tiledb.workable.com

More Related Content

What's hot

Big Analytics Without Big Hassles
Big Analytics Without Big HasslesBig Analytics Without Big Hassles
Big Analytics Without Big Hassles
Paradigm4
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
Catherine Kimani
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
hktripathy
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
Adam Doyle
 
Big data storages
Big data storagesBig data storages
Big data storages
DataArt
 
Tatyana Matvienko,Senior Java Developer, Big data storages
 Tatyana Matvienko,Senior Java Developer, Big data storages Tatyana Matvienko,Senior Java Developer, Big data storages
Tatyana Matvienko,Senior Java Developer, Big data storages
Alina Vilk
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
SahilRaina21
 
Massively Scalable Computational Finance with SciDB
 Massively Scalable Computational Finance with SciDB Massively Scalable Computational Finance with SciDB
Massively Scalable Computational Finance with SciDB
Paradigm4Inc
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
RojaT4
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
Mark Kromer
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
Stratebi
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
Archana Gopinath
 
Using SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS CubesUsing SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS Cubes
Code Mastery
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
rajkamaltibacademy
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
Nguyen Cao
 
Big Data - Part II
Big Data - Part IIBig Data - Part II
Big Data - Part II
Thanuja Seneviratne
 

What's hot (20)

Big Analytics Without Big Hassles
Big Analytics Without Big HasslesBig Analytics Without Big Hassles
Big Analytics Without Big Hassles
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Big data storages
Big data storagesBig data storages
Big data storages
 
Tatyana Matvienko,Senior Java Developer, Big data storages
 Tatyana Matvienko,Senior Java Developer, Big data storages Tatyana Matvienko,Senior Java Developer, Big data storages
Tatyana Matvienko,Senior Java Developer, Big data storages
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Massively Scalable Computational Finance with SciDB
 Massively Scalable Computational Finance with SciDB Massively Scalable Computational Finance with SciDB
Massively Scalable Computational Finance with SciDB
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Using SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS CubesUsing SSRS Reports with SSAS Cubes
Using SSRS Reports with SSAS Cubes
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
 
Big Data - Part II
Big Data - Part IIBig Data - Part II
Big Data - Part II
 

Similar to Debunking "Purpose-Built Data Systems:": Enter the Universal Database

Big Data
Big DataBig Data
Big Data
NGDATA
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
San Diego Supercomputer Center
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827
Anthony Potappel
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
Stavros Papadopoulos
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
ElsonPaul2
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
DignitasDigital1
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
Data Science Society
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
Igor Moochnick
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
lyarmey
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
GeekNightHyderabad
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
mark madsen
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory Webcast
Inside Analysis
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
WU (Vienna University of Economics and Business)
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
Rohit Dubey
 
Database Basics Theory
Database Basics TheoryDatabase Basics Theory
Database Basics Theory
sunmitraeducation
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
Klawal13
 

Similar to Debunking "Purpose-Built Data Systems:": Enter the Universal Database (20)

Big Data
Big DataBig Data
Big Data
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827BigData Behind-the-Scenes~20150827
BigData Behind-the-Scenes~20150827
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory Webcast
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Database Basics Theory
Database Basics TheoryDatabase Basics Theory
Database Basics Theory
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 

Recently uploaded

Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 

Recently uploaded (20)

Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 

Debunking "Purpose-Built Data Systems:": Enter the Universal Database

  • 1. TileDB webinars Debunking “purpose-built data systems”: Enter the Universal Database Founder & CEO of TileDB, Inc. Dr. Stavros Papadopoulos
  • 2. Who is this webinar for? You are building data(base) systems You are using data(base) systems … but you swim in a sea of different data tools and file formats You want to store, analyze and share diverse data at scale ... At data(base) companies or in-house At a large enterprise team, scientific organization or independently
  • 3. Disclaimer We are not delusional, we know what we are proposing is that audacious I am the exclusive recipient of complaints Email me at: stavros@tiledb.com All the credit for our amazing work goes to our powerful team Check it out at https://tiledb.com/about
  • 4. Deep roots at the intersection of HPC, databases and data science Traction with telecoms, pharmas, hospitals and other scientific organizations 35 employees with domain experts across all applications and domains Who we are TileDB got spun out from MIT and Intel Labs in 2017 WHERE IT ALL STARTED Raised over $20M, we are very well capitalized INVESTORS
  • 5. The Problem with “Purpose-Built Data Systems” The Definition of the Universal Database What the Database Community Missed Architectural Decisions in TileDB Proving Feasibility: Some Use Cases The Future of Data Management Agenda
  • 6. with purpose-built data systems The problem
  • 7. And then there was light We built a lot of sophistication around relational databases Relational algebra + SQL Roles, constraints, etc. Row-stores vs. column stores OLTP (transactional) vs. OLAP (warehouses) Shared-nothing architectures … and a lot more Relational databases worked beautifully for tabular data
  • 8. Big Data & the Sciences In the meantime, the sciences have been generating immense amounts of data This data is too big for traditional database architectures This data cannot be represented very well with tables Other database flavors (e.g., document) are not ideal either Not all scientists like “databases” :) Genomics Imaging (satellite / biomedical) LiDAR / SONAR / AIS Weather … and many more
  • 9. The Cloud effect As storage and compute got out of hand, cloud increased in popularity Separation of storage and compute was inevitable Cloud object stores were the obvious (cheapest) choice Old database architectures did not work off the shelf New paradigm: “lake houses” Store all data as (pretty much) flat files on cloud stores Use a “hammer” (computational framework) to scale compute Treat data management as an afterthought and adopt “hacks”
  • 10. The Machine Learning hype Everybody wants to jump on the next sexy thing Many new great frameworks and tools around ML People started to like coding and building new great things ML facilitated the advent of “Data Science” Everyone thought that ML is a “compute” problem In reality, ML is a data management problem But there was an important mistake
  • 11. And then there was mess Too many file formats and disparate files lying around in cloud buckets A metadata hell gave rise to “metadata systems” Data sharing became complex and gave rise to “governance systems” ML gave rise to numerous “feature / model stores” Thousands of “data” / “ML” companies and open-source tools got created Cloud vendors keep on pitching you hundreds of tools with funny names “Data management” (and ML became the noisiest problem space!
  • 12. The Problem in a nutshell Organizations lose time and money Science is being slowed down Organizations working on important problems are lost in the noise Use a combination of numerous data systems, difficult to orchestrate Or, build their own in-house solutions There is tons of re-invention of the wheel along the way Huge engineering overlap across domains and solutions Scientists spend most of their time as data engineers
  • 13. The definition of the Universal database
  • 14. What makes a database Universal A single system with efficient support for all types of data (including ML all types of metadata and catalogs Authentication, access control, security, logging all types of computations (SQL, ML, Linear Algebra, custom) Global sharing, collaboration and monetization all storage backends and computer hardware all language APIs and tool integrations “Infinite” computational scale
  • 15. Benefits of the Universal Database Future-proofness Don’t build a new system, extend your existing one No. More. Noise. Single data platform to solve problems with diverse data and analyses Single data platform for authentication, access control and auditing Easy, global-scale collaboration on data and (runnable) code Superb extensibility via APIs and other internal abstractions Modularity and API standardization Facilitates user creativity, preventing reinvention of the wheel
  • 16. as a database community What we missed
  • 17. Why no one had built it We are stuck in an echo chamber All cloud vendor marketing campaigns are around purpose-built systems Some purpose-built systems had success Universality intuitively seems like a long shot (and a LOT of work) Tons of funding currently poured on incremental data solutions The most promising data structure got overlooked! Other solutions used it without traction Arrays were never used to their fullest potential This structure is the multi-dimensional array
  • 18. How arrays were used Each cell is uniquely identified by integral coordinates Every cell has a value This is called a dense array Pretty good for storing images, video, … … but not good for tables and many more! A multi-dimensional object comprised of cells
  • 19. Pros of dense arrays Dense array engines provide fast ND slicing Dense arrays do not materialize coordinates Slicing is done via zero copy close to optimally In a table, we’d need to store the coordinates Then a SQL WHERE condition on coordinates Waste of space and query too slow Dense array engine Tabular format + SQL
  • 20. Cons of dense arrays No string dimensions No real dimensions No heterogeneous dimensions No cell multiplicities Hence, definitely no table support No efficient sparsity support
  • 21. A dense array database is not enough Too limited if it can’t store tables and other sparse data Remember, not everyone likes “databases” (the way they are perceived today) Scientists opted for array storage engines and custom formats / libraries Therefore, they missed out on the other important DB features Full circle, back to the mess
  • 22. The Lost Secret Sauce | The Data Model Heterogeneous dimensions (plus strings and reals) Cell multiplicities Arbitrary metadata Dense array In addition to dense arrays Native support for sparse arrays Sparse array
  • 23. The Lost Secret Sauce | The Data Model Arrays give you a flexible way to lay out the data on a 1D medium Arrays also allow you to chunk (or tile) your data and compress, encrypt, etc. Arrays provide rapid slicing from the 1D space, preserving the ND locality End result: Efficient, unified storage abstraction Can now build APIs, access control, versioning, compute framework, etc.
  • 24. The Lost Secret Sauce | The Data Model Sparse array Dense vector Dataframe Arrays subsume dataframes
  • 25. The Lost Secret Sauce | The Data Model What else can be modeled as an array LiDAR 3D sparse) SAR 2D or 3D dense) Population genomics (3D sparse) Single-cell genomics (2D dense or sparse) Biomedical imaging (2D or 3D dense) Even flat files!!! 1D dense) Time series (ND dense or sparse) Weather (2D or 3D dense) Graphs (2D sparse) Video (3D dense) Key-values (1D or ND sparse)
  • 26. The Lost Secret Sauce | The Compute Model Take any algorithm from any application domain and remove the jargon This algorithm can be split into one or more steps (the tasks) Each task typically operates on a slice of data This task graph engine should be part of the database with an exposed API Some tasks can work in parallel, some cannot (due to dependencies) We can build a single task graph engine, for any arbitrary compute
  • 27. The Lost Secret Sauce | Extensibility Data slicing and each task should be performed in any language A good bet is to build as much as possible in C Although 90% of the data management code is the same across applications Each scientist has their own favorite language and tool APIs should still be written in the application jargon Should support multiple backends and computer hardware (existing and new) There is no chance that the database can offer all operations built-in Operations should be crowdsourced and shared The database should provide the infrastructure to facilitate that
  • 28. The Lost Secret Sauce | Summary The foundation for a universal database: array data model + generic task graphs + extreme extensibility
  • 29. we took in TileDB Architectural decisions
  • 30. TileDB Cloud ❏ Access control and logging ❏ Serverless SQL, UDFs, task graphs ❏ Jupyter notebooks and dashboards Unified data management and easy serverless compute at global scale How we built a Universal Database Pluggable Compute: Efficient APIs & Tool Integrations TileDB Embedded Open-source interoperable storage with a universal open-spec array format ❏ Parallel IO, rapid reads & writes ❏ Columnar, cloud-optimized ❏ Data versioning & time traveling
  • 31. TileDB Embedded Any data Any tool Any backend .las .cog .vcf .csv Universal storage based on ND arrays
  • 32. Superior performance Built in C Fully-parallelized Columnar format Multiple compressors R-trees for sparse arrays TileDB Embedded https://github.com/TileDBInc/TileDB Open source: Rapid updates & data versioning Immutable writes Lock-free Parallel reader / writer model Time traveling
  • 33. Extreme interoperability Numerous APIs Numerous integrations All backends Optimized for the cloud Immutable writes Parallel IO Minimization of requests TileDB Embedded https://github.com/TileDBInc/TileDB Open source:
  • 34. TileDB Cloud Universal storage Universal tooling Universal data .las .cog .vcf .csv Universal scale Management. Collaboration. Scalability
  • 35. TileDB Cloud Works as SaaS https://cloud.tiledb.com Works on premises Currently on AWS, soon on any cloud Built to work anywhere Slicing, SQL, UDFs, task graphs It is completely serverless On-demand JupyterHub instances Can launch Jupyter notebooks Compute sent to the data It is geo-aware Authentication, compliance, etc. It is secure
  • 36. TileDB Cloud Full marketplace (via Stripe) Everything is monetizable Access control inside and outside your organization Make any data and code public Discover any public data and code (central catalog) Everything is shareable at global scale Jupyter notebooks UDFs and task graphs ML models Everything is an array! Dashboards (e.g., R shiny apps) All types of data (even flat files) Full auditability (data, code, any action) Everything is logged
  • 37. Some use cases Proving feasibility
  • 38. Anything tabular Cloud-optimized storage Fast multi-column slicing Integrations with MariaDB, Presto/Trino, Spark, pandas Serverless SQL on TileDB Cloud Flexibility in building and sharing distributed SQL
  • 39. Anything ML Model management (storage and sharing) Integration with TensorFlow, PyTorch and more Flexibility in building arbitrary pipelines Native data management (access control, logging) Scalability for training and servicing models
  • 40. Anything Geospatial Point cloud (LiDAR, SONAR, AIS Weather Raster Hyperspectral imaging All serverless Extreme scalability Tool integrations Data and code sharing & monetization SAR (temporal stacks)
  • 41. Genomics Population genomics All serverless Extreme scalability Tool integrations Collaboration and reproducibility Single-cell genomics Performance at a low cost
  • 42. Marketplaces Monetize any data & any code No data duplication and movement In-platform analytics No infrastructure management Flexible pay-as-you-go model
  • 43. Communities Share your work, learn from others, promote science A massive catalog of analysis-ready datasets A massive catalog of runnable code Collaboration and reproducibility
  • 45. Prediction Universal databases support tables and SQL work on the cloud If universal databases are proven to work, they will subsume warehouses and lake houses convert all file formats to arrays offer scalable compute support custom user code support anything ML
  • 46. How I view the future The future of data management is you! Stop building a new system for every single “twist” Build different components within some universal database Eventually even stop using term “universal” All databases must be universal by default Focus energy on Science, not unnecessary Engineering Build a massive collaborative data community Enable brilliance
  • 47. The Universal Database Thank you WE ARE HIRING Apply at tiledb.workable.com