Today's data economics is flawed. There is a need for a fundamental change in the way we produce, distribute and consume data. This presentation describes a solution with TileDB that can shape the future of data management.
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
Today's data economics is flawed. There is a need for a fundamental change in the way we produce, distribute and consume data. This presentation describes a solution with TileDB that can shape the future of data management.
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
Purpose-built databases and platforms have actually created more complexity, effort, and unnecessary reinvention. The status quo is a big mess. TileDB took the opposite approach.
In this presentation, Stavros, the original creator of TileDB, shared the underlying principles of the TileDB universal database built on multi-dimensional arrays, making the case for it as a true first in the data management industry.
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
Slides used in the webinar TileDB hosted with participation from Spire Maritime, describing the use and accessibility of massive time series maritime data on TileDB Cloud.
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
Want to learn Hadoop online? This PPT give you Introduction to Big Data Hadoop Training Online by expert trainers at ITJobZone.biz - Start your Hadoop Online training with this Presentation.
Introduction to Big Data Technologies & ApplicationsNguyen Cao
Big Data Myths, Current Mainstream Technologies related to Collecting, Storing, Computing & Stream Processing Data. Real-life experience with E-commerce businesses.
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Electronic Health Records systems are required to keep an audit trail of everyone accessing any patient information. What results is similar to the click-stream of a website. This audit data is required for regulatory compliance, but can also be useful in understanding behavior patterns and how processes actually get done. Mercy has more than 20TB of this access log data and has recently moved it from a specialized column-store database into Hadoop using HBase for storage and Apache Phoenix for JDBC / SQL access. Users access to the standard access log reports are through SAP Business Objects.
This session will cover various aspects of how this large volume of data (35 billion rows) was migrated from legacy systems and is being maintained in HBase, as well as how connectivity through Phoenix and Business Objects has been established.
An overview of Hadoop and Data warehouse from technologies and business viewpoints. The presentation also includes some of my personal observations and suggestions for people who want to join the field Big Data.
On Friday, September 25th Devin Hopps lead us through a presentation on an Introduction to Big Data and how technology has evolved to harness the power of Big Data.
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
You can watch the replay for this webcast in the IDERA Resource Center: http://ow.ly/QHaG50A58ZB
Many information technology professionals may not recognize it, but the bulk of their work has been and continues to be nothing more than database migrations. In the old days to share files across systems, then to move files into relational databases, then to load into data warehouses, and finally now we're moving to NoSQL and the cloud. In the presentation we'll delve into the ever growing and increasingly complex world of database migrations. Some of these considerations include what issues must be planned for and overcome, what problems are likely to occur, and what types of tools exist.
Database expert Bert Scalzo will cover these and many other database migration concerns.
About Bert: Bert Scalzo is an Oracle ACE, author, speaker, consultant, and a major contributor for many popular database tools used by millions of people worldwide. He has 30+ years of database experience and has worked for several major database vendors. He has BS, MS and Ph.D. in computer science plus an MBA. He has presented at numerous events and webcasts. His areas of key interest include data modeling, database benchmarking, database tuning, SQL optimization, "star schema" data warehousing, running databases on Linux or VMware, and using NVMe flash based technology to speed up database performance.
D.3.1: State of the Art - Linked Data and Digital PreservationPRELIDA Project
by D. Giaretta (APARSEN), presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
Want to learn Hadoop online? This PPT give you Introduction to Big Data Hadoop Training Online by expert trainers at ITJobZone.biz - Start your Hadoop Online training with this Presentation.
Introduction to Big Data Technologies & ApplicationsNguyen Cao
Big Data Myths, Current Mainstream Technologies related to Collecting, Storing, Computing & Stream Processing Data. Real-life experience with E-commerce businesses.
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Electronic Health Records systems are required to keep an audit trail of everyone accessing any patient information. What results is similar to the click-stream of a website. This audit data is required for regulatory compliance, but can also be useful in understanding behavior patterns and how processes actually get done. Mercy has more than 20TB of this access log data and has recently moved it from a specialized column-store database into Hadoop using HBase for storage and Apache Phoenix for JDBC / SQL access. Users access to the standard access log reports are through SAP Business Objects.
This session will cover various aspects of how this large volume of data (35 billion rows) was migrated from legacy systems and is being maintained in HBase, as well as how connectivity through Phoenix and Business Objects has been established.
An overview of Hadoop and Data warehouse from technologies and business viewpoints. The presentation also includes some of my personal observations and suggestions for people who want to join the field Big Data.
On Friday, September 25th Devin Hopps lead us through a presentation on an Introduction to Big Data and how technology has evolved to harness the power of Big Data.
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
You can watch the replay for this webcast in the IDERA Resource Center: http://ow.ly/QHaG50A58ZB
Many information technology professionals may not recognize it, but the bulk of their work has been and continues to be nothing more than database migrations. In the old days to share files across systems, then to move files into relational databases, then to load into data warehouses, and finally now we're moving to NoSQL and the cloud. In the presentation we'll delve into the ever growing and increasingly complex world of database migrations. Some of these considerations include what issues must be planned for and overcome, what problems are likely to occur, and what types of tools exist.
Database expert Bert Scalzo will cover these and many other database migration concerns.
About Bert: Bert Scalzo is an Oracle ACE, author, speaker, consultant, and a major contributor for many popular database tools used by millions of people worldwide. He has 30+ years of database experience and has worked for several major database vendors. He has BS, MS and Ph.D. in computer science plus an MBA. He has presented at numerous events and webcasts. His areas of key interest include data modeling, database benchmarking, database tuning, SQL optimization, "star schema" data warehousing, running databases on Linux or VMware, and using NVMe flash based technology to speed up database performance.
D.3.1: State of the Art - Linked Data and Digital PreservationPRELIDA Project
by D. Giaretta (APARSEN), presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Stavros Papadopoulos
Slides by Stavros Papadopoulos (TileDB) and Jason Brown (Capella Space) from the joint TileDB-Capella Space webinar held in April 2022 on SAR and LiDAR data analytics.
Solving the Really Big Tech Problems with IoTEric Kavanagh
The Briefing Room with Dr. Robin Bloor and HPE Security
The Internet of Things brings new technological problems: sensor communications are bi-directional, the scale of data generation points has no precedent and, in this new world, security, privacy and data protection need to go out to the edge. Likely, most of that data lands in Hadoop and Big Data platforms. With the need for rapid analytics never greater, companies try to seize opportunities in tighter time windows. Yet, cyber-threats are at an all-time high, targeting the most valuable of assets—the data.
Register for this episode of The Briefing Room to hear Analyst Dr. Robin Bloor explain the implications of today's divergent data forces. He’ll be briefed by Reiner Kappenberger of HPE, who will discuss how a recent innovation -- NiFi -- is revolutionizing the big data ecosystem. He’ll explain how this technology dramatically simplifies data flow design, enabling a new era of business-driven analysis, while also protecting sensitive data.
Managing The Data Deluge By Optimizing StorageDell World
IDC predicts the overall big data and analytics market will hit $125 billion in 2015 as organizations increasingly seek to gain insight and competitive advantage from their ever-increasing volumes of data. Learn how Dell's broad portfolio of flexible, scalable and cost-effective storage solutions with cutting-edge flash, intelligent data placement, and software-defined technologies deliver a more agile and efficient data infrastructure to better achieve these goals.
Data Virtualization to Survive a Multi and Hybrid Cloud WorldDenodo
Watch full webinar here:https://buff.ly/2Edqlpo
Hybrid cloud computing is slowing becoming the standard for businesses. The transition to hybrid can be challenging depending on the environment and the needs of the business. A successful move will involve using the right technology and seeking the right help. At the same time, multi-cloud strategies are on the rise. More enterprise organizations than ever before are analyzing their current technology portfolio and defining a cloud strategy that encompasses multiple cloud platforms to suit specific app workloads, and move those workloads as they see fit.
In this session, you will learn:
*Key challenges of migration to the cloud in a complex data landscape
*How data virtualization can help build a data driven, multi-location cloud architecture for real time integration
*How customers are taking advantage of data virtualization to save time and costs with limited resources
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinStorage Switzerland
Join Storage Switzerland's Founder George Crump and Caringo's VP of Products Tony Barbagallo for this on demand webinar. They compare NAS and object storage and provide 10 reasons they think object storage will be the file server of the future for the enterprise.
Eliminating the Problems of Exponential Data Growth, Foreverspectralogic
Balancing explosive data growth while addressing the need for extended data protection is mandatory for any IT department. But customers today find it difficult to address these challenges because of the software management layers and tools required in order to meet longer retention mandates. While exponential data growth is not a new problem, the quandary that IT faces in 2014, now has a new solution.
Join Spectra and IDC as we identify the greatest dilemmas facing data centers in 2014, and explore the capabilities of Spectra’s newest product, the BlackPearl™ Deep Storage Appliance. During this brief webinar, attendees will learn about:
-A situation analysis of today’s software-defined data center
-How moving to an “elastic” data center enables more cost-effective and efficient data management
-Emerging technologies and key strategies to store and manage data indefinitely
Webinar: End NAS Sprawl - Gain Control Over Unstructured DataStorage Switzerland
The key to ending NAS Sprawl is to fix the file system so it can offer cost effective, scalable, high performance storage. In this webinar Storage Switzerland Lead Analyst George Crump, Quantum VP of Global Marketing Molly Rector, and the Quantum StorNext Solution Marketing Senior Director Dave Frederick discuss the challenges facing the typical scale-out storage environment and what IT professionals should be looking for in solutions to eliminate NAS Sprawl once and for all.
Workshop
December 9, 2015
LBS College of Engineering
www.sarithdivakar.info | www.csegyan.org
http://sarithdivakar.info/2015/12/09/wordcount-program-in-python-using-apache-spark-for-data-stored-in-hadoop-hdfs/
Webinar: Cloud Storage: The 5 Reasons IT Can Do it BetterStorage Switzerland
In this webinar learn the five reasons why a private cloud storage system may be more cost effective and deliver a higher quality of service than public cloud storage providers.
In this webinar you will learn:
1. What Public Cloud Storage Architectures Look Like
2. Why Public Providers Chose These Architectures
3. The Problem With Traditional Data Center File Solutions
4. Bringing Cloud Lessons to Traditional IT
5. The Five Reasons IT can Do it Better
How to Radically Simplify Your Business Data ManagementClusterpoint
Relational databases were designed for tabular data storage model. It requires complex software: schemas, encoded data, inflexible relations, sophisticated indexes. Complexity of your IT systems increases over your database life-time many-fold. Your costs too. Yet, we have a solution for this.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
Similar to Population genomics is a data management problem (20)
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
1. TileDB webinars - Nov 4, 2021
Population Genomics is a
Data Management Problem
Founder & CEO of TileDB, Inc.
Dr. Stavros Papadopoulos
2. Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
40 members with expertise across all applications and domains
Who we are
TileDB was spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS
Originally developed GenomicsDB (collaboration between Intel and Broad)
3. Disclaimer
I am the exclusive recipient of complaints
Email me at: stavros@tiledb.com
All the credit for our amazing work goes to our powerful team
Check it out at https://tiledb.com/about
4. Their mission is to empower every person to improve their life through DNA
Multi-year collaboration to scale population genomics workflows
Provided tons of ideas and optimizations to TileDB-VCF
Find more info about Helix at helix.com
Special Thanks
5. Agenda
The problem in population genomics
A general solution
A concrete solution with TileDB
Dr. Stephen Kingsmore on genome-informed inpatient pediatric care
TileDB-VCF walkthrough (Dr. Aaron Wolen)
Work in progress
6. The Problem | On the Surface
A large collection of
(single-sample)
VCF files
...
Analysis is done by “slicing” a portion across files
Mainly two approaches:
● Separate single-sample VCFs
● Combined multi-sample VCFs
Specialized downstream tools, expecting VCF inputs
All solutions around files and environments
7. The Problem | On the Surface
Problem with single-sample VCF
Latency from slicing each file separately adds up
Problems with multi-sample VCF
Storage space scales super-linearly with the number of samples
Multi-sample VCF file cannot be updated (N+1 problem)
Scaling population genomics is blocked
This is just a tiny fraction of the problem in Genomics
8. The Holistic Problem
Data management
is nowhere in the
picture
The whole
data economics
in genomics is flawed
9. Data Economics
Consumption
How tools can compute
on the data, where
does the computation
happen
Distribution
Who has access to the
data, what is the means
of access, and
monetization
Production
What format does the
data get produced in
and where does it get
stored
10. The Production Problem
slow & expensive
often custom & in-house
costly & time consuming
Some analytics
infra
Specialized applications,
wrangling and fusion
Storage in some cloud
bucket or file manager
Numerous VCF files,
also some tables
11. The Distribution Problem #1
wasteful re-invention
Storage in some cloud
bucket or marketplace Org #N:
Download + Wrangle +
Built analytics infra
Org #1:
Download + Wrangle +
Built analytics infra
Numerous VCF files,
also some tables
12. Numerous VCF files,
also some tables
The Distribution Problem #2
Data owner bears the distribution cost,
Re-invention across data owners
wrangling,
etc.
some analytics
infra
Queries by
consumer #1
Queries by
consumer #N
13. The Consumption Problem
inefficient & costly,
poor governance
Storage in some cloud
bucket or server
Group #N:
Wrangle + Copy - Use tool & infra #N
Group #1:
Wrangle + Copy - Use tool & infra #1
Numerous VCF files,
also some tables
14. The Solution
Universal
data management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
All Data Science tools
No infrastructure hassles
No downloads or copies
Efficient and cloud-native
Solves N+1 problem
Unifies all data
Accessible by any tool
Global-scale governance
One infra, you own the data
Collaboration and reproducibility
Marketplace built-in
Cost shifted to consumer
15. Enter TileDB
Secure governance & collaboration
Scalable, serverless compute
Data & code sharing & monetization
Pay-as-you-go, consumer pays
Extreme interoperability
Zero infrastructure
multi-dimensional arrays
Universal data
management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
16. The Secret Sauce | The Data Model
Dense array
Store everything as dense or sparse multi-dimensional arrays
Sparse array
17. query range
expansion
anchor_gap
S
a
m
p
l
e
(
s
t
r
i
n
g
)
Position (uint32)
1 2 ...
v7
v4
v1
v3
Indel/CNVs
...
Contig
(string)
chr1
chr2
chr3
v2
v5 v6
SNPs
a1
anchor_gap
anchor
Population Genomics with TileDB
Store variant call data as 3D sparse arrays
Storage
query range
results
v3
v2
v1 a1
v4
v5 v6 v7
S
a
m
p
l
e
(
s
t
r
i
n
g
)
Position (uint32)
1 2 ...
...
Contig
(string)
chr1
chr2
chr3
Retrieval
https://github.com/TileDB-Inc/TileDB-VCF
19. The Secret Sauce | The Data Model
What can be modeled as an array
LiDAR (3D sparse)
SAR (2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense) Even flat files!!! (1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
Tables (1D dense or ND sparse)
20. TileDB Cloud
❏ Access control and logging
❏ Serverless SQL, UDFs, task graphs
❏ Jupyter notebooks and dashboards
Unified data management
and easy serverless compute
at global scale
How we built a Universal Database
Efficient APIs & tool integrations, zero-copy techniques
TileDB Embedded
Open-source interoperable
storage with a universal
open-spec array format
❏ Parallel IO, rapid reads & writes
❏ Columnar, cloud-optimized
❏ Data versioning & time traveling
21. Superior
performance
Built in C++
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
24. TileDB Cloud
Works as SaaS: https://cloud.tiledb.com
Works on premises
Currently on AWS, soon on any cloud
Built to work anywhere
Slicing, SQL, UDFs, task graphs
It is completely serverless
On-demand JupyterHub instances
Can launch Jupyter notebooks
Compute sent to the data
It is geo-aware
Authentication, compliance, etc.
It is secure
25. TileDB Cloud
Full marketplace (via Stripe)
Everything is monetizable
Access control inside and outside your
organization
Make any data and code public
Discover any public data and code
(central catalog)
Everything is shareable at global scale
Jupyter notebooks
UDFs and task graphs
ML models
Everything is an array!
Dashboards (e.g., R shiny apps)
All types of data (even flat files)
Full auditability (data, code, any action)
Everything is logged
26. Work in Progress
or why you should depart from file formats and
specialized solutions
RLE compression for strings
Compute directly on compressed data
Compute push-down
Fine-grained access policies
Constant perf optimizations