Neurodb Engr245 2021 Lessons Learned

NeuroDB
Tony Wang Kun Guo Andrew Freiman Daniel Kharitonov
Picker
Stanford CS Ph.D.
Hacker
Stanford MS&E MS
Hustler
Stanford MBA
Designer
Stanford MS&E Ph.D., MS.CS
Database ML
Todd Basche
Advisor
Maria Popo
Mentor
101 Interviews
NeuroDB
Where we are now…
Cloud-based Pandas dataframe
Where we started...
Unstructured data Tableau-like tool

CTOs and
engineering
managers looking to
cutting deep learning
model development
time and cost
Organic through
open-source
adoption
B2B Sales to
monetize our API
Need
Interact with large
unstructured datasets
(text, images and video)
quickly and without
supplementary models.
Web dashboard/
playground to demo
functionalities
Free python package for
developers
Paid API with model
hosting
Community Members /
Evangelists
ML Engineers and data
scientists/analysts
looking for ways to work
faster
Open-source
adopters, developers,
data scientists, ML
engineers
Engineering software /
deployment infrastructure
Developer community
who build applets and
add functionality.
Machine Learning
Champions in the
Enterprise
Data science and
business intelligence
teams that need user-
friendly tools
Loved and
supported by the
open-source
community.
Preferred by
enterprise data
analytics/science
groups as the goto
solution
Product
Data query engine that
executes state-of-art
models under the hood.
R&D
Community
Evangelism
Model Serving
Problem
Long turnaround time to
build deep learning
models with
unstructured data
Paid API
Faster and more
accurate served models
Premium Web
Dashboard
with paid API key
Business analysts
interacting with large
unstructured datasets.
The business that we thought we were building...
Introduction Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 The Future

NeuroDB
E - entreprises care more
about structured data
Our emotional roller coaster ride...
“Tableau” with
Deep learning
C - Interest from subset
of consultants
Open-source virtual data
warehouse. Pharma and
GIS beachhead
C - Limited market
size for consultants,
1
3
2
C- Demoing
Consumer MVP
4
5
6
E - Enterprise
Unstructured data MVP
7
8
Product 1 =
Cloud Pandas!
9
Build out open
source offering,
community and
customer base!
Too many
products, no
market fit
Intelligent data
lake management,
3 ideas to 1
E - Enterprise
offering?

NeuroDB
Experiment 1: Demoing NeuroDB with Customers
4
Talked to 15+ Consultants
& Analysts
3-4 Expressed
Interest
Talked to 15+ Consultants & Analysts
3-4 Expressed Interest…
0 had a need for the tool within the next two weeks
0 followed up about trying the product
Negative Signal
In hindsight, we should have noticed this WAY earlier. We weren’t listening...

NeuroDB
How many consultants
do we need to talk to?
Reflection: What could we have done differently?
4
In hindsight, we should have noticed this WAY earlier. We weren’t listening...
What # of analysts need
to express interest?
Some teammates believed
1. The sample size was too small
1. The potentially small market size was offset by the potentially high WTP
We should have
1. Defined what product market fit looked like earlier
1. Defined our goals as a team

NeuroDB
At the same time, we started exploring the idea of an
enterprise product
2
“The biggest problem for us right now is to get
insights in real time.”
- Director of data science at major satellite radio
Is our Deep Learning expertise well positioned to solve
a real problem facing Enterprises?

NeuroDB
Experiment 2: Enterprise MVP
4
Talked to ~10 Directors of
Data
No real
interest!
Talked to ~10 Directors of data science at large
enterprises in telecom, airlines, comms
People had very little unstructured data
What unstructured data they have are very
specific with vertical centric solutions.
Negative Signal

NeuroDB
Experiment 3: asking about ETL
No clear advantage in a
crowded market!
4
Insight: instead of transforming unstructured data why don’t we transform structured data

NeuroDB
Any problems with existing tools?
1. Data integrity?
2. Messy data?
3. Have to keep rewriting scripts?
Not really. And if there is a problem, 10 other people
are working on it already.
Also a poor fit for our expertise.
5

NeuroDB
5 We lacked clear direction, felt like we weren’t making
progress and hit a low point...

NeuroDB
In class, we got BURNED...and we deserved it
5

NeuroDB
5 We had some emergency calls
You need to choose a product!
We don’t care which one but pick one...

NeuroDB
Our mentors shared their thoughts…
...Teammates had conflicting opinions...
We created an analytical framework which diffused tensions...
5 We had some fierce discussions
Todd Basche
Advisor
Maria Popo
Mentor

NeuroDB
5 We analyzed our options...and MADE A DECISION!

NeuroDB
Focus - Intelligent Data Lake Management
6
Data Lake KPI 1
Churn
CEO’s
5pm
musings
Materialize
NeuroDB
Virtual Data
Warehouse
NeuroDB
Data Science
IDE

NeuroDB
“I want it”, but …
- “We can’t be your first customer”
- “Do you have an open source software that we
can try first?”
- “This is more of a strategic thing, we need to have
a high level strategic discussion about data,
then we will decide what we will do”
- “PoC process will probably take a few months,
have to go through IT security to be cleared etc.”
TLDR: What we are doing is too ambitious.
6

NeuroDB
We need to start somewhere
● Open source
● Functionally different or dramatically better than competitors
● Small-medium sized enterprises/startups or small autonomous
teams in large companies
● Where should we start?
○ Data Catalog?
○ Data Linkage?
○ Data Analytics? <*
7

NeuroDB
We found a signal in the area of large scale cloud
analytics - we feel competition is limited...
● Back to machine learning / deep learning except we don’t do the machine
learning / deep learning!
● We provide the infrastructure to efficiently run deep learning models and
other user defined functions at large scale in a cloud native manner through
an interactive virtual dataframe.
● Performance targets: > 5x over Spark
● People have given us concrete things they need to see for a PoC!
8
Why dig for AI gold, when we can sell the shovels?

NeuroDB
Here are our customers!
Criteria for pilot: native support
for images, see that it works
large scale -- VP of AI
Software
Criteria for pilot: runs on
Azure, can integrate with
Databricks -- VP of Claim
Analytics
Want to see the open-
source product.
--Chief Architect of the
Enterprise Data Office

NeuroDB
Fall 2021
Incorporate
2021-2022
PhD Open Source Development
Summer 2021
Analysis of Technical
Feasibility
Can we be 10x better?
How long will it take?
Spring 2021
Lean Launchpad
We’re starting to build an Open Source MVP!

VP Data Science/
Head of Data
Analytics looking to
cut cloud cost and dev
time
Organic through
open-source
adoption
B2B Sales to
monetize our API
Need
Allow users to perform
scalable Pandas-like
operations natively on
cloud
Free python package for
developers
Cloud Service for
enterprise solution
Community Members /
Evangelists
ML Engineers and data
scientists/analysts
looking for ways to work
faster and help open
source dev
Open-source
adopters, developers,
data scientists, ML
engineers
Engineering software /
deployment infrastructure
Developer community
who build applets and
add functionality.
Data Science
Champions in the
Enterprise
Business analysts who
can quickly answer
questions using
NeuroDB and cheer for it
Loved and
supported by the
open-source
community.
Preferred by
enterprise data
analytics/science
groups as the go-to
solution
Product
An open-source offering
to allow cloud pandas
R&D
Community
Evangelism
Managed
Service Cloud
Cost
Problem
Pandas is useful but not
cloud native
Cloud Service
Storage of data and
deep learning models
Cloud Service
Computation of queries,
charge by time.
Here is our business model!

We plan to make NeuroDB a reality
We want to hear from you!
tony@neurodb.io

NeuroDB
For data scientists, NeuroDB is an Intelligent Data Lake query
tool
that powers end-users to harness data sources directly.
Unlike our competitors, we can treat data lake as a 1st class
object
Position Statement

NeuroDB
Value Proposition
Efficiently interact with large volumes (> 1 TB) of images and other
unstructured data types in cloud storage with a local Pandas-like
dataframe interface without worrying about provisioning clusters or
managing compute resources.
- Run deep learning pipelines on millions of images interactively
- Save on data movement costs from S3
- Save time on managing AWS
- Easily deploy batch analytics workflows on streaming data

NeuroDB
Serverless Pandas - High Content Biology
ID Images Cell Line Treatment
1 HeLa Dox
2 HeLa Control
3 MCF7 Dox
4 MCF7 Control
Pandas
DataFrame
abstraction
in Jupyter
notebook
Optimization
and dispatch

NeuroDB
Serverless Pandas - GIS
ID Coordinate*
(virtual schema)
2005-01-01 … 2020-12-31 Metadata
...
1 Zone #,
Corners E/N
Satellite,
Sensor,
Processing,
CRS
2
3
4
*UTM WGS84,NAD83, PSD93, etc
Pandas
DataFrame
abstraction
in Jupyter
notebook
Optimization
and
dispatch

NeuroDB
Define Workflows with user defined functions
1 2 3 4 5 6
1
2
UDF1 5
5
3
UDF2 Intermediate
4
UDF3 6

NeuroDB
Deploy to streaming data
1
2
UDF1 5
1 2 3 4 5
Workflow 1 Workflow 2 Workflow 3
Workflow Registry

NeuroDB
Key Optimizations: Batch
NeuroDB, under the hood,
- Uses serverless computing to allow near-infinite scaling with fixed compute
cost; execute deep learning pipelines on millions of rows in real time.
- Optimizes batch size, workflow DAG and lower cost by selecting the best
hardware (CPU/GPU, core count, instance type etc.) for the job
- Cache data transparently or automatically to speed up future pipelines.
- Deep Learning: model acceleration with quantization or pruning options to
take advantage of latest hardware advancements.

NeuroDB
Key Optimizations: Streaming
- Automatic hardware configuration to meet given throughput and latency target
with minimum cost.
- Easily deploy workflows developed on batch datasets on streaming data
without changing a single line of code.
- Convert the workflow into a UI for edge deployment, e.g. in lab. Run the
workflow on new data just by uploading inputs through a web UI.

Neurodb Engr245 2021 Lessons Learned

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neurodb Engr245 2021 Lessons Learned

Similar to Neurodb Engr245 2021 Lessons Learned (20)

More from Stanford University

More from Stanford University (20)

Recently uploaded

Recently uploaded (20)

Neurodb Engr245 2021 Lessons Learned