SlideShare a Scribd company logo
1 of 44
Download to read offline
The Future Roadmap for the
Composable Data Stack
Wes McKinney
Data Council Austin 2024
⌵
⌵
Agenda
● My background
● Posit and Me
● Composable data systems: an overview
● Active growth areas in 2024
● Opportunities and Predictions
⌵
⌵
Me
Principal
Architect
General Partner
Co-founder,
Advisor
Ibis
Creator Co-creator
Creator Advisor,
Investor
Author
⌵
⌵
Voltron Data (2021 — )
● Unlocking the potential of GPU
accelerated analytics for large
scale workloads
● Enterprise support for Apache
Arrow
● Open source partnerships
(Meta, Snowflake, others)
● $115M raised, 130 headcount
and growing
⌵
⌵
⌵
⌵
Posit PBC
● Founded 2009, originally as RStudio
● 300-person, remote-first company
● Open source software for code-first data science and
technical communication
● Certified B corp, no plans to go public or be acquired
● Designing for long-term resiliency, creating a 100-year
company
⌵
⌵
How Posit Makes Money
● Making open source work in the enterprise
● Products for managing the data science lifecycle
○ Posit Workbench: Managed IDEs and Jupyter
Notebooks with enterprise-grade security & compliance
○ Posit Connect: Publish and share interactive data apps,
notebooks, and reports
○ Posit Package Manager: securely manage deploy open
source and proprietary Python and R packages
⌵
⌵
My History with Posit
● 2016: Hadley Wickham and I make Feather file format
● 2018: I partner with RStudio to create non-profit Ursa Labs
for Arrow development
● 2020: Ursa team spins out into startup, becomes Voltron
Data in 2021
● 2022: RStudio becomes Posit, embracing polyglot future
● 2023: I rejoin Posit as a principal architect
⌵
⌵
https://ursalabs.org/blog/announcing-ursalabs/
The reality is that Hadley and I think the “language wars” are stupid when the
real problem we are solving is human user interface design for data analysis.
… The programming languages are our medium for crafting accessible and
productive tools. It has long been a frustration of mine that it isn’t easier to
share code and systems between R and Python.
…
I found that we share a passion for the long-term vision of empowering data
scientists and building a positive relationship with the open source user community.
Critically, RStudio has avoided the “startup trap” and managed to build a sustainable
business while still investing the vast majority of its engineering resources in open
source development.
⌵
⌵
https://ursalabs.org/blog/announcing-ursalabs/
The reality is that Hadley and I think the “language wars” are stupid when the real
problem we are solving is human user interface design for data analysis. … The
programming languages are our medium for crafting accessible and productive tools.
It has long been a frustration of mine that it isn’t easier to share code and systems
between R and Python.
…
I found that we share a passion for the long-term vision of empowering data
scientists and building a positive relationship with the open source user
community. Critically, RStudio has avoided the “startup trap” and managed to
build a sustainable business while still investing the vast majority of its
engineering resources in open source development.
⌵
⌵
Creative Solutions for Funding OSS Work
Sponsors
⌵
⌵
VLDB 2023
⌵
⌵
What is a “Composable Data System”?
● Builds with open standards and protocols
● Designed around modularity, reuse, and interoperability
with other systems that use shared interfaces
● Resists “vertical integration”, builds in a virtual cycle with
relevant open source ecosystem projects
⌵
⌵
“We envision that by decomposing data management systems
into a more modular stack of reusable components, the
development of new engines can be streamlined, while reducing
maintenance costs and ultimately providing a more consistent
user experience. By clearly outlining APIs and encapsulating
responsibilities, data management software could more easily be
adapted, for example, to leverage novel devices and accelerators, as
the underlying hardware evolves. By relying on a modular stack that
reuses execution engine and language frontend, data systems code
could provide a more consistent experience and semantics to users,
from transactional to analytic systems, from stream processing to
machine learning workloads."
Pedrera et al. 2023
⌵
⌵
“We envision that by decomposing data management systems into a
more modular stack of reusable components, the development of new
engines can be streamlined, while reducing maintenance costs and
ultimately providing a more consistent user experience. By clearly
outlining APIs and encapsulating responsibilities, data
management software could more easily be adapted, for example,
to leverage novel devices and accelerators, as the underlying
hardware evolves. By relying on a modular stack that reuses
execution engine and language frontend, data systems code could
provide a more consistent experience and semantics to users, from
transactional to analytic systems, from stream processing to machine
learning workloads."
Pedrera et al. 2023
⌵
⌵
“We envision that by decomposing data management systems into a
more modular stack of reusable components, the development of new
engines can be streamlined, while reducing maintenance costs and
ultimately providing a more consistent user experience. By clearly
outlining APIs and encapsulating responsibilities, data management
software could more easily be adapted, for example, to leverage novel
devices and accelerators, as the underlying hardware evolves. By
relying on a modular stack that reuses execution engine and
language frontend, data systems code could provide a more
consistent experience and semantics to users, from transactional
to analytic systems, from stream processing to machine learning
workloads."
Pedrera et al. 2023
⌵
⌵
Why now?
● 1st-Gen Big Data / Hadoop Era: 2006 - 2013
○ MapReduce popularizes disaggregated storage + compute
● 2nd-Gen 2013 - 2021
○ Vendors shift from proprietary software to delivery of services
○ Open source standards emerge and are popularized
○ Rapid progress in storage, networking, computing performance
● 3rd-Gen 2021 - … ?
○ Open standards for composability become widely accepted
○ Next-gen components emerge, existing systems start retrofitting with
composable pieces
⌵
⌵
Why now?
“...We foresee that composability is soon to cause
another major disruption to how data management
systems are designed. We foresee that monolithic
systems will become obsolete, and give space to a
new composable era for data management.”
Pedrera et al. 2023
⌵
⌵
NYC R Conference 2015
https://www.youtube.com/watch?v=stlxbC7uIzM&t=264s
⌵
⌵
McSherry, Isard, Murray 2015
⌵
⌵
“The published work on big data systems… detail[s]
their systems’ impressive scalability, [but] few directly
evaluate their absolute performance against
reasonable benchmarks. To what degree are these
systems truly improving performance, as opposed to
parallelizing overheads that they themselves
introduce?”
- McSherry, Isard, Murray 2015
⌵
⌵
Glimpse of the Composable Landscape
Engines
Substrait
ADBC, Flight
Storage
Protocols
Query Interface
Optimization
…?
⌵
⌵
Missing pieces
● Distributed computing (Spark, Dask, Ray, “Serverless”)
● ETL modeling (dbt and others)
● Workflow / Infra Orchestration (Airflow, Dagster, Flyte, …)
● Development Environments
● Metadata Catalogues
⌵
⌵
2016: A Cross-Language Fast
In-Memory Data Format
⌵
⌵
https://www.slideshare.net/wesm/data-science-without-borders-jupytercon-2017
2017
⌵
⌵
2017
⌵
⌵
Databases Equipped for the Composable Era
● ADBC : Arrow DataBase Connectivity
● FlightSQL: An Arrow-native wire protocol for SQL systems
⌵
⌵
⌵
⌵
2018
⌵
⌵
Layers of the Composable Cake
⌵
⌵
Modular Execution Engines
● Apache DataFusion (Rust)
● DuckDB (C++)
● Velox (C++)
● Theseus (C++ / CUDA)
⌵
⌵
Connecting Execution with Frontend + Optimizer
● Intermediate Representation (IR) for Queries
Substrait
https://substrait.io/
⌵
⌵
Modular Acceleration Projects
● Prestissimo (Velox in Presto)
● Apache DataFusion Comet (DataFusion in Spark)
● Apache Gluten (incubating) (Velox in Spark)
⌵
⌵
A Multi-Engine Data Stack?
● Why be locked into one full-stack execution engine (like
Spark)?
● Cost/performance/latency varies greatly across workload
shapes and sizes
● You can do a lot with DuckDB, but actual big data does still
exist
HT to Julien Hurault ! https://juhache.substack.com/p/multi-engine-data-stack-v0
⌵
⌵
⌵
⌵
Barriers to a multi-engine data stack
● SQL dialects are non-portable
and feature a wide spectrum of
supported features
○ sqlglot is helping with this!
● Deciding which engine-to-use is
non-trivial
● Diverse orchestration and
infrastructure requirements
⌵
⌵
CIDR 2021
⌵
⌵
⌵
⌵
Enter the Bird
● An effort to harmonize the best of modern SQL with
Pythonic fluent data frames
● Fixing a long list of shortcomings in the pandas API
● Leverage the benefits of a modern PL when creating
complex analytical SQL queries
● Bringing portability (> 20 backends supported) to provide a
unified Python API for a multi-engine data stack
ibis-project.org
⌵
⌵
https://ibis-project.org/posts/bigquery-arrays/
⌵
⌵
Other Ibis Notes
● Created in 2015 at the same time as Arrow
● Strong focus on deep SQL support
● Facilitates dev to prod with few code changes
● Integration with many third party libraries (streamlit, altair,
others)
● Helps automating the drudgery of tedious multi-step DDL
workflows in databases
⌵
⌵
Why this matters https://voltrondata.com/theseus
⌵
⌵
A Future Multi-Engine Stack
● An execution engine tailored for different data scales
○ <= 1TB: DuckDB and Friends
○ 1 - 10TB: Spark, Dask, Ray, etc.
○ > 10TB: Hardware-accelerated Processing (e.g. Theseus)
● A portable language front end (Ibis, Malloy, PRQL, or a
standard transpilable SQL)
● Arrow-native API and wire transport
Closing Thoughts

More Related Content

What's hot

Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
DATAVERSITY
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
Data Con LA
 

What's hot (20)

Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Serverless Computing
Serverless ComputingServerless Computing
Serverless Computing
 
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at NetflixTableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and search
 
User behavior analytics
User behavior analyticsUser behavior analytics
User behavior analytics
 
Dimensional Modeling Basic Concept with Example
Dimensional Modeling Basic Concept with ExampleDimensional Modeling Basic Concept with Example
Dimensional Modeling Basic Concept with Example
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
AIRBNB DATA WAREHOUSE & GRAPH DATABASE
AIRBNB DATA WAREHOUSE & GRAPH DATABASEAIRBNB DATA WAREHOUSE & GRAPH DATABASE
AIRBNB DATA WAREHOUSE & GRAPH DATABASE
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
From a hack to Data Mesh (Devoxx 2022)
From a hack to Data Mesh (Devoxx 2022)From a hack to Data Mesh (Devoxx 2022)
From a hack to Data Mesh (Devoxx 2022)
 

Similar to The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024

Consulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCSConsulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCS
Victor M Torres
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
KDZ - Zentrum für Verwaltungsforschung
 

Similar to The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024 (20)

The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
 
Coding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistanceCoding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - Phdassistance
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Making advanced analytics accessible to more companies
Making advanced analytics accessible to more companiesMaking advanced analytics accessible to more companies
Making advanced analytics accessible to more companies
 
MSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BIMSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BI
 
Consulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCSConsulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCS
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategies
 
Abhishek jaiswal
Abhishek jaiswalAbhishek jaiswal
Abhishek jaiswal
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
IRJET- Cost Effective Workflow Scheduling in Bigdata
IRJET-  	  Cost Effective Workflow Scheduling in BigdataIRJET-  	  Cost Effective Workflow Scheduling in Bigdata
IRJET- Cost Effective Workflow Scheduling in Bigdata
 
It Consulting & Services - Black Basil Technologies
It Consulting & Services  - Black Basil TechnologiesIt Consulting & Services  - Black Basil Technologies
It Consulting & Services - Black Basil Technologies
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 

More from Wes McKinney

More from Wes McKinney (20)

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 

Recently uploaded

Recently uploaded (20)

AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024

  • 1. The Future Roadmap for the Composable Data Stack Wes McKinney Data Council Austin 2024
  • 2. ⌵ ⌵ Agenda ● My background ● Posit and Me ● Composable data systems: an overview ● Active growth areas in 2024 ● Opportunities and Predictions
  • 4. ⌵ ⌵ Voltron Data (2021 — ) ● Unlocking the potential of GPU accelerated analytics for large scale workloads ● Enterprise support for Apache Arrow ● Open source partnerships (Meta, Snowflake, others) ● $115M raised, 130 headcount and growing
  • 6. ⌵ ⌵ Posit PBC ● Founded 2009, originally as RStudio ● 300-person, remote-first company ● Open source software for code-first data science and technical communication ● Certified B corp, no plans to go public or be acquired ● Designing for long-term resiliency, creating a 100-year company
  • 7. ⌵ ⌵ How Posit Makes Money ● Making open source work in the enterprise ● Products for managing the data science lifecycle ○ Posit Workbench: Managed IDEs and Jupyter Notebooks with enterprise-grade security & compliance ○ Posit Connect: Publish and share interactive data apps, notebooks, and reports ○ Posit Package Manager: securely manage deploy open source and proprietary Python and R packages
  • 8. ⌵ ⌵ My History with Posit ● 2016: Hadley Wickham and I make Feather file format ● 2018: I partner with RStudio to create non-profit Ursa Labs for Arrow development ● 2020: Ursa team spins out into startup, becomes Voltron Data in 2021 ● 2022: RStudio becomes Posit, embracing polyglot future ● 2023: I rejoin Posit as a principal architect
  • 9. ⌵ ⌵ https://ursalabs.org/blog/announcing-ursalabs/ The reality is that Hadley and I think the “language wars” are stupid when the real problem we are solving is human user interface design for data analysis. … The programming languages are our medium for crafting accessible and productive tools. It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. … I found that we share a passion for the long-term vision of empowering data scientists and building a positive relationship with the open source user community. Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development.
  • 10. ⌵ ⌵ https://ursalabs.org/blog/announcing-ursalabs/ The reality is that Hadley and I think the “language wars” are stupid when the real problem we are solving is human user interface design for data analysis. … The programming languages are our medium for crafting accessible and productive tools. It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. … I found that we share a passion for the long-term vision of empowering data scientists and building a positive relationship with the open source user community. Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development.
  • 11. ⌵ ⌵ Creative Solutions for Funding OSS Work Sponsors
  • 13. ⌵ ⌵ What is a “Composable Data System”? ● Builds with open standards and protocols ● Designed around modularity, reuse, and interoperability with other systems that use shared interfaces ● Resists “vertical integration”, builds in a virtual cycle with relevant open source ecosystem projects
  • 14. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 15. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 16. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 17. ⌵ ⌵ Why now? ● 1st-Gen Big Data / Hadoop Era: 2006 - 2013 ○ MapReduce popularizes disaggregated storage + compute ● 2nd-Gen 2013 - 2021 ○ Vendors shift from proprietary software to delivery of services ○ Open source standards emerge and are popularized ○ Rapid progress in storage, networking, computing performance ● 3rd-Gen 2021 - … ? ○ Open standards for composability become widely accepted ○ Next-gen components emerge, existing systems start retrofitting with composable pieces
  • 18. ⌵ ⌵ Why now? “...We foresee that composability is soon to cause another major disruption to how data management systems are designed. We foresee that monolithic systems will become obsolete, and give space to a new composable era for data management.” Pedrera et al. 2023
  • 19. ⌵ ⌵ NYC R Conference 2015 https://www.youtube.com/watch?v=stlxbC7uIzM&t=264s
  • 21. ⌵ ⌵ “The published work on big data systems… detail[s] their systems’ impressive scalability, [but] few directly evaluate their absolute performance against reasonable benchmarks. To what degree are these systems truly improving performance, as opposed to parallelizing overheads that they themselves introduce?” - McSherry, Isard, Murray 2015
  • 22. ⌵ ⌵ Glimpse of the Composable Landscape Engines Substrait ADBC, Flight Storage Protocols Query Interface Optimization …?
  • 23. ⌵ ⌵ Missing pieces ● Distributed computing (Spark, Dask, Ray, “Serverless”) ● ETL modeling (dbt and others) ● Workflow / Infra Orchestration (Airflow, Dagster, Flyte, …) ● Development Environments ● Metadata Catalogues
  • 24. ⌵ ⌵ 2016: A Cross-Language Fast In-Memory Data Format
  • 27. ⌵ ⌵ Databases Equipped for the Composable Era ● ADBC : Arrow DataBase Connectivity ● FlightSQL: An Arrow-native wire protocol for SQL systems
  • 30. ⌵ ⌵ Layers of the Composable Cake
  • 31. ⌵ ⌵ Modular Execution Engines ● Apache DataFusion (Rust) ● DuckDB (C++) ● Velox (C++) ● Theseus (C++ / CUDA)
  • 32. ⌵ ⌵ Connecting Execution with Frontend + Optimizer ● Intermediate Representation (IR) for Queries Substrait https://substrait.io/
  • 33. ⌵ ⌵ Modular Acceleration Projects ● Prestissimo (Velox in Presto) ● Apache DataFusion Comet (DataFusion in Spark) ● Apache Gluten (incubating) (Velox in Spark)
  • 34. ⌵ ⌵ A Multi-Engine Data Stack? ● Why be locked into one full-stack execution engine (like Spark)? ● Cost/performance/latency varies greatly across workload shapes and sizes ● You can do a lot with DuckDB, but actual big data does still exist HT to Julien Hurault ! https://juhache.substack.com/p/multi-engine-data-stack-v0
  • 36. ⌵ ⌵ Barriers to a multi-engine data stack ● SQL dialects are non-portable and feature a wide spectrum of supported features ○ sqlglot is helping with this! ● Deciding which engine-to-use is non-trivial ● Diverse orchestration and infrastructure requirements
  • 39. ⌵ ⌵ Enter the Bird ● An effort to harmonize the best of modern SQL with Pythonic fluent data frames ● Fixing a long list of shortcomings in the pandas API ● Leverage the benefits of a modern PL when creating complex analytical SQL queries ● Bringing portability (> 20 backends supported) to provide a unified Python API for a multi-engine data stack ibis-project.org
  • 41. ⌵ ⌵ Other Ibis Notes ● Created in 2015 at the same time as Arrow ● Strong focus on deep SQL support ● Facilitates dev to prod with few code changes ● Integration with many third party libraries (streamlit, altair, others) ● Helps automating the drudgery of tedious multi-step DDL workflows in databases
  • 42. ⌵ ⌵ Why this matters https://voltrondata.com/theseus
  • 43. ⌵ ⌵ A Future Multi-Engine Stack ● An execution engine tailored for different data scales ○ <= 1TB: DuckDB and Friends ○ 1 - 10TB: Spark, Dask, Ray, etc. ○ > 10TB: Hardware-accelerated Processing (e.g. Theseus) ● A portable language front end (Ibis, Malloy, PRQL, or a standard transpilable SQL) ● Arrow-native API and wire transport