© 2017 Continuum Analytics - Confidential & Proprietary© 2018 Quansight - Confidential & Proprietary
Thriving in a Data Driven World
travis@quansight.com
@quansightai
@teoliphant
Converge 2019
https://www.quansight.com
1998 20182001
2015
2009 20122005
…
2001
2006
My Python Data and ML/AI Time-Line
1991
2003
2014
2008
2010 2016
2009
Starting companies to sustain OSS
renamed
~18 million Anaconda users
Peter Wang
Building new solutions
Replaced by
Spin Out
Spin Out
2012
2018
?
Key members of the management team at Continuum created
Quansight. In a real sense NumFOCUS and Anaconda are our first
(spin-out) organizations.
2015
Build and Connect
Companies and
Communities to
Solve Challenging
Problems with Data
Continuing my quest to find more
ways to pay developers to work on
open source!
© 2018 Quansight - Confidential & Proprietary
6
Core Business
Quansight Labs Support
Staffing / Mentoring / Training
Custom Data/Viz/ML Consulting
Open Source Consulting
An early stage venture
capital firm investing in
startups that build on
open-source technology
and support the
communities they
depend on.
Bradden Blair
supporting
FairOSS
LABS
Sustaining the Future
Open-source innovation and
maintenance around the entire data-
science and AI workflow.
• NumPy ecosystem maintenance (PyData Core Team)
• Improve connection of NumPy to ML Frameworks
• GPU Support for NumPy Ecosystem
• Improve foundations of Array computing
• JupyterLab and JupyterHub
• Data Catalog standards
• Packaging (conda-forge, PyPA, etc.)
uarray — unified array interface for SciPy refactor
xnd — re-factored NumPy (low-level cross-language
libraries for N-D (tensor) computing)
Collaborating with
NumFOCUS and Ursa Labs
(supporting Apache Arrow)
Bokeh
Adapted from Jake Vanderplas
PyCon 2017 Keynote
My little “side projects” became my life
Where I started
Started as my graduate student
“procrastination project” (as Multipack)
in 1998 and became SciPy in 2001 with
the help of Eric Jones, Pearu Peterson,
and others.
108 releases, 766 contributors
Used by: 128,495
SciPy
“Distribution of Python Numerical Tools masquerading as one Library”
Name Description
cluster KMeans and Vector Quantization
fftpack Discrete Fourier Transform
integrate Numerical Integration
interpolate Interpolation routines
io Data Input and Output
linalg Fast Linear algebra
misc Utilities
ndimage N-dimensional Image processing
Name Description
odr Orthogonal Distance Regression
optimize
Constrained and Unconstrained
Optimization
signal Signal Processing Tools
sparse Sparse Matrices and Algebra
spatial Spatial Data Structures and Algorithms
special Special functions (e.g. Bessel)
stats Statistical Functions and Distributions
Where it led for me
159 releases, 827 contributors
Used by: 254,856
Standard Array/Tensor Library driving Python
to be de facto language for Data Science and ML
Brief History of NumPy
Person Package Year
Jim Fulton Matrix Object 1994
Jim Hugunin Numeric 1995
Perry Greenfield,
Rick White,Todd
Miller
Numarray 2001
Travis Oliphant NumPy 2005
NumPy was created to unify array objects in
Python and unify the early PyData community
Numeric
Numarray
NumPy
I essentially sacrificed tenure at a University to write NumPy and
unify array objects.
Python’s Scientific Ecosystem
Bokeh
Jake Vanderplas PyCon 2017 Keynote
Huge Impact (from diverse efforts of 1000s)
LIGO : Gravitional Waves
Higgs Boson
Discovery
Black Hole
Imaging
Java
JavaScript
Python
Google Search Trends
Jun 2019
Thriving in a Data-Driven
World starts with building on
the Open Source Software
that forms the foundation of
Data Science and Machine
Learning today.
Open Source
Ecosystem
Your Product/
Project
With Quansight, you can actually
“influence the direction of the wind”
LABS
Open-source powered development
Community Work Orders let you influence OSS
Cooperative Platform for Community Work Orders
An effective case study in connecting with open source communities
(harnessing and influencing the open-source wind)
Quansight and OmniSci funded 10+ open source developers for
1¾ years to connect OmniSci with the Pydata community
! JupyterLab Extensions
! Ibis SQL Framework (OmniSci Backend, geospatial
function)
! Altair & VegaLite Visualization (Modernized visualization
specifications)
! Conda packages
! User Defined Table Functions with Numba
OmniSci Immerse & JupyterLab working interchangably. Python
Data Scientists and OmniSci users can work in a unified
development environment.
Jupyter Lab Extension
OmniSci Engine can be
Connected directly to
JupyterLab components
The data can be used by
the entire PyData
ecosystem
All the open-source
deliciousness can be re-
used
As the community tools
get better, OmniSci users
benefit automatically!
Compile (Numerical) Python
to Native code for CPU and GPU
an open source JIT compiler that
translates a subset of Python and
NumPy code into fast machine code.
http://numba.pydata.org
Omnisci Table
User-defined
Table Functions
Result Set
rbc (Remote-Backend Compiler)
SQL Engine
https://github.com/xnd-project/rbcXNDhttps://xnd.io
Problem
Open Source Teams
! Burned out
! Underrepresented
! Underpaid
Organizations
! Disconnected from
the Community
! Lack support and
maintenance
There’s no easy way to connect the
community with organizations
Solution
A marketplace where companies can cooperatively fund progress and
maintenance for projects and technology that affects them
Organizations
Save money &
Reduce risk
Teams
Improve
project health
Copyright OpenTeams 2019. All rights reserved.
Projects develop their roadmaps
Copyright OpenTeams 2019. All rights reserved.
Product
Organizations find and fund projects they depend on
Product
Companies hire from the communityProduct
Initiatives are Progress or
Maintenance with an
accountable organization
committed to finalizing and
following-up to do the work
using open-source devs.
The platform enables easy
signaling and cooperation
between many potential
funders and organized open-
source groups.
Alpha Feature!
Several Deep Learning Libraries to choose
Built on NumPy/SciPy
Recommended
Recommended
Key Features Needed for any ML Library
• Ability to create chains of functions on n-dimensional arrays
• Ability to derive the derivative of the Loss-Function quickly (Automatic
Differentiation)
• Key Loss Functions implemented
• Cross-validation methods
• An Optimization library with several useful methods
• Ability to compute functions on n-dimensional arrays on multiple
hardware with highly parallel-execution
• Ability to create chains of functions on n-dimensional arrays
• Ability to compute functions on n-dimensional arrays on multiple hardware
For Training
For Inference
Missing from NumPy / SciPy and Scikit-Learn
Most Libraries (other than Chainer) chose
to re-implement NumPy and SciPy as they
needed.
• Started with a legacy code in another language
• Had to work with other languages too (Node, Java, C++, Lua, etc.)
• Needed only a subset of functionality of NumPy / SciPy to build ML
• Needed GPU support
• Lacked familiarity with the NumPy / SciPy communities and how to engage
with them
Reasons:
Result: Many competing similar choices for Deep Learning
]
https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose
http://deeplearning.net/software_links/
http://scikit-learn.org/stable/related_projects.html
Explosion of ML Frameworks and libraries
TVM/NNVM
Now array-like objects everywhere
Sparse Arrays
Neon
CUDArray
NumPy was created to unify array objects in
Python and unify the early PyData community
Numeric
Numarray
NumPy
I essentially sacrificed tenure at a University to write NumPy and
unify array objects.
We have a “divided” community again!
Numeric
Numarray
NumPy
Python’s Scientific Ecosystem
Bokeh
Jake Vanderplas PyCon 2017 Keynote
Examples of packages being built on
fragmented APIS
FastAI
skorch
Pyro
Eduard
anyrl
Braid
PyMC4
Horovod
MLFlow
But note
Real Problem — Funding for Community Devs
Full-time: 2 Full-time: .5
Full-time: 1
Open Source is too important to be just left to volunteer time — current situation is not working to
sustain millions of users:
• No funding for creators of these libraries to continue their work
• GPU support could have been added to NumPy years ago
• SciPy took 17 years to hit 1.0
• NumPy should already be at 2.0 — but not without full-time guidance
Full-time: 2
Full-time: 0
Solution
A marketplace where companies can cooperatively fund progress and
maintenance for projects and technology that affects them
Organizations
Save money &
Reduce risk
Teams
Improve
project health
Copyright OpenTeams 2019. All rights reserved.
Initiatives are Progress or
Maintenance with an
accountable organization
committed to finalizing and
following-up to do the work
using open-source devs.
The platform enables easy
signaling and cooperation
between many potential
funders and organized open-
source groups.
Community proposal — gathering support
High Level APIs for Arrays (Tensors),
DataFrames, and DataTypes
LABS
OpenTensors
• Community-driven and governed with many companies and
contributors (project managed by Quansight Labs)
• Addition of standardized automatic differentiation, graph-construction
(lazy mode), addition of GPUs, and sparse arrays
• Use for Deep Learning but all the other uses of PyData/NumFOCUS
ecosystem
Provide a community-sponsored and backed future!
Join Us!
• Solidifying commitments for at least $6million over 3 years
($2million / year) (need <33% from any one company).
• Register support for the initiative on openteams.com
• Email rgommers@quansight.com, travis@quansight.com or
matt@quansight.com
• Tweet to @quansightai or @openteamsinc
• Get in touch to ensure your needs are included in the initial
deliverables
How to Thrive In a Data-Driven World?
Open Source Contributors of the Projects you depend on!

Keynote at Converge 2019

  • 1.
    © 2017 ContinuumAnalytics - Confidential & Proprietary© 2018 Quansight - Confidential & Proprietary Thriving in a Data Driven World travis@quansight.com @quansightai @teoliphant Converge 2019 https://www.quansight.com
  • 2.
    1998 20182001 2015 2009 20122005 … 2001 2006 MyPython Data and ML/AI Time-Line 1991 2003 2014 2008 2010 2016 2009
  • 3.
    Starting companies tosustain OSS renamed ~18 million Anaconda users Peter Wang
  • 4.
    Building new solutions Replacedby Spin Out Spin Out 2012 2018 ? Key members of the management team at Continuum created Quansight. In a real sense NumFOCUS and Anaconda are our first (spin-out) organizations. 2015
  • 5.
    Build and Connect Companiesand Communities to Solve Challenging Problems with Data Continuing my quest to find more ways to pay developers to work on open source!
  • 6.
    © 2018 Quansight- Confidential & Proprietary 6 Core Business Quansight Labs Support Staffing / Mentoring / Training Custom Data/Viz/ML Consulting Open Source Consulting
  • 7.
    An early stageventure capital firm investing in startups that build on open-source technology and support the communities they depend on. Bradden Blair supporting FairOSS
  • 8.
    LABS Sustaining the Future Open-sourceinnovation and maintenance around the entire data- science and AI workflow. • NumPy ecosystem maintenance (PyData Core Team) • Improve connection of NumPy to ML Frameworks • GPU Support for NumPy Ecosystem • Improve foundations of Array computing • JupyterLab and JupyterHub • Data Catalog standards • Packaging (conda-forge, PyPA, etc.) uarray — unified array interface for SciPy refactor xnd — re-factored NumPy (low-level cross-language libraries for N-D (tensor) computing) Collaborating with NumFOCUS and Ursa Labs (supporting Apache Arrow) Bokeh Adapted from Jake Vanderplas PyCon 2017 Keynote
  • 9.
    My little “sideprojects” became my life
  • 10.
    Where I started Startedas my graduate student “procrastination project” (as Multipack) in 1998 and became SciPy in 2001 with the help of Eric Jones, Pearu Peterson, and others. 108 releases, 766 contributors Used by: 128,495
  • 11.
    SciPy “Distribution of PythonNumerical Tools masquerading as one Library” Name Description cluster KMeans and Vector Quantization fftpack Discrete Fourier Transform integrate Numerical Integration interpolate Interpolation routines io Data Input and Output linalg Fast Linear algebra misc Utilities ndimage N-dimensional Image processing Name Description odr Orthogonal Distance Regression optimize Constrained and Unconstrained Optimization signal Signal Processing Tools sparse Sparse Matrices and Algebra spatial Spatial Data Structures and Algorithms special Special functions (e.g. Bessel) stats Statistical Functions and Distributions
  • 12.
    Where it ledfor me 159 releases, 827 contributors Used by: 254,856 Standard Array/Tensor Library driving Python to be de facto language for Data Science and ML
  • 13.
    Brief History ofNumPy Person Package Year Jim Fulton Matrix Object 1994 Jim Hugunin Numeric 1995 Perry Greenfield, Rick White,Todd Miller Numarray 2001 Travis Oliphant NumPy 2005
  • 14.
    NumPy was createdto unify array objects in Python and unify the early PyData community Numeric Numarray NumPy I essentially sacrificed tenure at a University to write NumPy and unify array objects.
  • 15.
    Python’s Scientific Ecosystem Bokeh JakeVanderplas PyCon 2017 Keynote
  • 16.
    Huge Impact (fromdiverse efforts of 1000s) LIGO : Gravitional Waves Higgs Boson Discovery Black Hole Imaging
  • 17.
  • 18.
    Thriving in aData-Driven World starts with building on the Open Source Software that forms the foundation of Data Science and Machine Learning today. Open Source Ecosystem Your Product/ Project
  • 19.
    With Quansight, youcan actually “influence the direction of the wind” LABS Open-source powered development Community Work Orders let you influence OSS Cooperative Platform for Community Work Orders
  • 20.
    An effective casestudy in connecting with open source communities (harnessing and influencing the open-source wind)
  • 21.
    Quansight and OmniScifunded 10+ open source developers for 1¾ years to connect OmniSci with the Pydata community ! JupyterLab Extensions ! Ibis SQL Framework (OmniSci Backend, geospatial function) ! Altair & VegaLite Visualization (Modernized visualization specifications) ! Conda packages ! User Defined Table Functions with Numba OmniSci Immerse & JupyterLab working interchangably. Python Data Scientists and OmniSci users can work in a unified development environment.
  • 23.
  • 24.
    OmniSci Engine canbe Connected directly to JupyterLab components The data can be used by the entire PyData ecosystem All the open-source deliciousness can be re- used As the community tools get better, OmniSci users benefit automatically!
  • 26.
    Compile (Numerical) Python toNative code for CPU and GPU an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code. http://numba.pydata.org
  • 27.
    Omnisci Table User-defined Table Functions ResultSet rbc (Remote-Backend Compiler) SQL Engine https://github.com/xnd-project/rbcXNDhttps://xnd.io
  • 29.
    Problem Open Source Teams !Burned out ! Underrepresented ! Underpaid Organizations ! Disconnected from the Community ! Lack support and maintenance There’s no easy way to connect the community with organizations
  • 30.
    Solution A marketplace wherecompanies can cooperatively fund progress and maintenance for projects and technology that affects them Organizations Save money & Reduce risk Teams Improve project health Copyright OpenTeams 2019. All rights reserved.
  • 31.
    Projects develop theirroadmaps Copyright OpenTeams 2019. All rights reserved. Product
  • 32.
    Organizations find andfund projects they depend on Product
  • 33.
    Companies hire fromthe communityProduct
  • 34.
    Initiatives are Progressor Maintenance with an accountable organization committed to finalizing and following-up to do the work using open-source devs. The platform enables easy signaling and cooperation between many potential funders and organized open- source groups. Alpha Feature!
  • 35.
    Several Deep LearningLibraries to choose Built on NumPy/SciPy Recommended Recommended
  • 36.
    Key Features Neededfor any ML Library • Ability to create chains of functions on n-dimensional arrays • Ability to derive the derivative of the Loss-Function quickly (Automatic Differentiation) • Key Loss Functions implemented • Cross-validation methods • An Optimization library with several useful methods • Ability to compute functions on n-dimensional arrays on multiple hardware with highly parallel-execution • Ability to create chains of functions on n-dimensional arrays • Ability to compute functions on n-dimensional arrays on multiple hardware For Training For Inference Missing from NumPy / SciPy and Scikit-Learn
  • 37.
    Most Libraries (otherthan Chainer) chose to re-implement NumPy and SciPy as they needed. • Started with a legacy code in another language • Had to work with other languages too (Node, Java, C++, Lua, etc.) • Needed only a subset of functionality of NumPy / SciPy to build ML • Needed GPU support • Lacked familiarity with the NumPy / SciPy communities and how to engage with them Reasons: Result: Many competing similar choices for Deep Learning
  • 38.
  • 39.
    Now array-like objectseverywhere Sparse Arrays Neon CUDArray
  • 40.
    NumPy was createdto unify array objects in Python and unify the early PyData community Numeric Numarray NumPy I essentially sacrificed tenure at a University to write NumPy and unify array objects.
  • 41.
    We have a“divided” community again! Numeric Numarray NumPy
  • 42.
    Python’s Scientific Ecosystem Bokeh JakeVanderplas PyCon 2017 Keynote
  • 43.
    Examples of packagesbeing built on fragmented APIS FastAI skorch Pyro Eduard anyrl Braid PyMC4 Horovod MLFlow But note
  • 44.
    Real Problem —Funding for Community Devs Full-time: 2 Full-time: .5 Full-time: 1 Open Source is too important to be just left to volunteer time — current situation is not working to sustain millions of users: • No funding for creators of these libraries to continue their work • GPU support could have been added to NumPy years ago • SciPy took 17 years to hit 1.0 • NumPy should already be at 2.0 — but not without full-time guidance Full-time: 2 Full-time: 0
  • 45.
    Solution A marketplace wherecompanies can cooperatively fund progress and maintenance for projects and technology that affects them Organizations Save money & Reduce risk Teams Improve project health Copyright OpenTeams 2019. All rights reserved.
  • 46.
    Initiatives are Progressor Maintenance with an accountable organization committed to finalizing and following-up to do the work using open-source devs. The platform enables easy signaling and cooperation between many potential funders and organized open- source groups.
  • 47.
    Community proposal —gathering support
  • 48.
    High Level APIsfor Arrays (Tensors), DataFrames, and DataTypes LABS
  • 49.
    OpenTensors • Community-driven andgoverned with many companies and contributors (project managed by Quansight Labs) • Addition of standardized automatic differentiation, graph-construction (lazy mode), addition of GPUs, and sparse arrays • Use for Deep Learning but all the other uses of PyData/NumFOCUS ecosystem Provide a community-sponsored and backed future!
  • 50.
    Join Us! • Solidifyingcommitments for at least $6million over 3 years ($2million / year) (need <33% from any one company). • Register support for the initiative on openteams.com • Email rgommers@quansight.com, travis@quansight.com or matt@quansight.com • Tweet to @quansightai or @openteamsinc • Get in touch to ensure your needs are included in the initial deliverables
  • 51.
    How to ThriveIn a Data-Driven World? Open Source Contributors of the Projects you depend on!