SlideShare a Scribd company logo
1 of 61
Download to read offline
Full Stack Deep Learning
Josh Tobin, Sergey Karayev, Pieter Abbeel
Data Management
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Data Management - overview
Full Stack Deep Learning 3
https://veekaybee.github.io/2019/02/13/data-science-is-different/
Data Management - overview
Full Stack Deep Learning
In this module
4
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
In this module
5
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
• Most DL applications require lots of labeled data

• Exceptions: RL, GANs, "semi-supervised" learning

• Publicly available datasets = No competitive advantage

• But can serve as starting point
6
Sources
Data Management - sources
Full Stack Deep Learning
Usually: spend $$$ and time to label own data
7
https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg
Data Management - sources
Full Stack Deep Learning 8Data Management - sources
Data flywheel
Enables rapid improvement with user labels
Full Stack Deep Learning
9
Data Management - sources
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html
Semi-supervised learning
Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image source:
Doersch et al., 2015)
Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk)
Use parts of data to label other parts
Full Stack Deep Learning
Image data augmentation
• Must do for training vision models

• Keras, fast.ai provide functions that do this

• Hook into your Dataset class to do in CPU
workers in parallel to GPU training
10
https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c
Data Management - sources
Full Stack Deep Learning
Other data augmentation
• Tabular

• Delete some cells to simulate missing
data

• Text

• No well established techniques, but
replace words with synonyms, change
order of things.

• Speech/video

• Change speed, inserts pauses, etc
11
https://github.com/makcedward/nlpaug
Data Management - sources
Full Stack Deep Learning 12
https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/
Data Management - sources
Synthetic data
Underrated idea that is almost
always worth starting with
Full Stack Deep Learning 13
This can get pretty deep!
Andrew Moffat - https://github.com/amoffat/metabrite-receipt-tests
Data Management - sources
Full Stack Deep Learning 14
https://microsoft.github.io/AirSim/
Data Management - sources
https://openai.com/blog/ingredients-for-robotics-research/
Especially for driving and robotics
Full Stack Deep Learning
Questions?
15
Full Stack Deep Learning
In this module
16
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Labeling
1. User Interfaces

2. Sources of labor

3. Service companies
17Data Management - labeling
Full Stack Deep Learning 18
Standard set of features:

- bounding boxes,
segmentations,
keypoints, cuboids

- set of applicable
classes
Data Management - labeling
Full Stack Deep Learning 19
Training the annotators is crucial
Quality assurance is key
Data Management - labeling
Full Stack Deep Learning 20
• Hire own annotators, promote best ones to quality control

• Pros: secure, fast (once hired), less QC needed

• Cons: expensive, slow to scale, admin overhead

• ...or, crowdsource (Mechanical Turk)
• Pros: cheaper, more scalable

• Cons: not secure, significant QC effort required

• ...or, full-service data labeling companies
Sources of Labor
Data Management - labeling
Full Stack Deep Learning
Service Companies
21
• Data labeling requires separate software stack, temporary labor, and
quality assurance. Makes sense to outsource.
• Dedicate several days to selecting the best one for you:

• Label gold standard data yourself

• Sales calls with several contenders, ask for work sample on same data

• Ensure agreement with your gold standard, and evaluate on value
Data Management - labeling
Full Stack Deep Learning 22
FigureEight is the original AI data labeling company
Data Management - labeling
Full Stack Deep Learning 23
Scale.ai is a dominant up-and-comer
Data Management - labeling
Full Stack Deep Learning 24
And there are a ton of others
Data Management - labeling
Full Stack Deep Learning 25
And there are a ton of others
Data Management - labeling
Full Stack Deep Learning
Software
26
• Full-service data labeling is always pricy

• But some companies offer their software without labor
Data Management - labeling
Full Stack Deep Learning 27
Prodigy
Data Management - labeling
Full Stack Deep Learning 28
Prodigy
Data Management - labeling
Full Stack Deep Learning 29
• Conclusions

• outsource to full-service company if you can afford it

• if not, then at least use existing software

• hiring part-time makes more sense than trying to make crowdsourcing
work
Data Management - labeling
Full Stack Deep Learning
Questions?
30
Full Stack Deep Learning
In this module
31
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage
Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Storage
1. Building blocks

- Filesystem

- Object Storage

- Database

- "Data Lake"

2. What goes where

3. Where to learn more
32Data Management - storage
Full Stack Deep Learning
Filesystem
• Foundational layer of storage.

• Fundamental unit is a "file", which can be text or binary, is not versioned,
and is easily overwritten.

• Can be as simple as a locally mounted disk containing all the files you need.

• Can be networked (e.g. NFS): accessible over network by multiple machines.

• Can be distributed (e.g. HDFS): stored and accessed over multiple
machines.

• Be mindful of access patterns! Can be fast, but not parallel.
33Data Management - storage
Full Stack Deep Learning
Object Storage
• An API over the filesystem. GET, PUT, DELETE files to a service, without
worrying where they are actually stored.

• Fundamental unit is an "object". Usually binary: image, sound file, etc.

• Versioning, redundancy can be built into the service.

• Can be parallel, but not fast.

• Amazon S3 is the canonical example.

• For on-prem setup, Ceph is a good solution.
34Data Management - storage
Full Stack Deep Learning
Database
• Persistent, fast, scalable storage and retrieval of structured data.

• Mental model: everything is actually in RAM, but software ensures that
everything is logged to disk and never lost.

• Fundamental unit is a "row": unique id, references to other rows, values in
columns.

• Not for binary data! Store references instead.

• Usually for data that will be accessed again (not logs).

• Postgres is the right choice >90% of the time. Best-in-class SQL, great support
for unstructured JSON, actively developed.
35Data Management - storage
Full Stack Deep Learning
SQL
• SQL is the right interface for
structured data.

• If you avoid using it, you will
eventually reinvent a slow
and crappy subset of it.
36Data Management - storage
Full Stack Deep Learning
"Data Lake"
• Unstructured aggregation of data from multiple sources, e.g. databases,
logs, expensive data transformations.

• "Schema on read": dump everything in, then transform for specific
needs later.

37
https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02
Data Management - storage
Full Stack Deep Learning
What goes where
• Binary data (images, sound files, compressed texts) is stored as
objects.

• Metadata (labels, user activity) is stored in database.

• If need features which are not obtainable from database (logs), set up
data lake and a process to aggregate needed data.

• At training time, copy the data that is needed onto filesystem (local or
networked).
38Data Management - storage
Full Stack Deep Learning
Where to learn more
39
https://dataintensive.net
Data Management - storage
Full Stack Deep Learning
Questions?
40
Full Stack Deep Learning
In this module
41
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning
Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Versioning
Level 0: unversioned

Level 1: versioned via snapshot at training time

Level 2: versioned as a mix of assets and code

Level 3: specialized data versioning solution
42Data Management - versioning
Full Stack Deep Learning
Level 0
• Data lives are on filesystem/S3 and database

• Problem: Deployments must be versioned. Deployed machine learning
models are part code, part data. If data is not versioned, deployed
models are not versioned.

• Problem you will face: inability to get back to a previous level of
performance
43Data Management - versioning
Full Stack Deep Learning
Level 1
• Data is versioned by storing a snapshot of everything at training time

• This allows you to version deployed models, and to get back to past
performance, but is super hacky.

• Would be far better to be able to version data just as easily as code.
44Data Management - versioning
Full Stack Deep Learning
Level 2
• Data is versioned as a mix of assets and code.

• Heavy files stored in S3, with unique ids. Training data is stored as JSON or
similar, referring to these ids and include relevant metadata (labels, user activity,
etc).

• JSON files can get big, but using git-lfs lets us store them just as easily as code

• Can improve further with "lazydata": only syncing files that are needed.

• The git signature + of the raw data file defines the version of the dataset

• Often helpful to add timestamp
45Data Management - versioning
Full Stack Deep Learning
Level 3
• Specialized solutions for versioning data.

• Avoid these until you can fully explain how they will improve your project.

• Leading solutions are DVC, Pachyderm, Quill.
46Data Management - versioning
Full Stack Deep Learning
DVC
47
1
2 3
4
Data Management - versioning
Full Stack Deep Learning
Pachyderm
48Data Management - versioning
Full Stack Deep Learning
Dolt
49
A nice simple solution for
versioning databases,
that speaks SQL.
Data Management - versioning
Full Stack Deep Learning
Dolt
50Data Management - versioning
Full Stack Deep Learning
Questions?
51
Full Stack Deep Learning
In this module
52
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing
Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Motivational Example
•We have to train a photo popularity predictor every night.

• For each photo, training data must include:

• Metadata such as posting time, title, location

• Some features of the user, such as how many times they
logged in today.

• Outputs of photo classifiers (content, style)
53
In database
Need to compute
from logs
Need to run classifiers
Data Management - processing
Full Stack Deep Learning 54
Task Dependencies
• Some tasks can't be
started until other tasks
are finished.

• Finishing a task should
"kick off" its dependencies
Data Management - processing
Full Stack Deep Learning
• Run everything from scratch. If things exist, no need to recompute.
55
https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418
Data Workflows
Data Management - processing
Full Stack Deep Learning
Makefile limitations
• What if re-computation needs to depend on content, not on date?

• What if the dependencies are not files, but disparate programs and
databases?

• What if the work needs to be spread over multiple machines?

• What if there are 100s of "Makefiles" all executing at the same time, with
shared dependencies?
56Data Management - processing
Full Stack Deep Learning
Airflow
57
https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418
Data Management - processing
Full Stack Deep Learning
Distributing work
• The workflow manager has a queue for the tasks, and manages workers
that pull from it, restarting jobs if they fail.
58
http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/
Data Management - processing
Full Stack Deep Learning
Try to keep things simple
• Don't overengineer

• e.g. unix: powerful parallelism,
streaming, highly optimized
59
https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
26 minutes
70 seconds
18 seconds
in parallel
in parallel
Data Management - processing
Full Stack Deep Learning
Questions?
60
Full Stack Deep Learning
Thank you!
61

More Related Content

What's hot

Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 
Three Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityThree Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityDevOps.com
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4jNeo4j
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Artem Chebotko
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 

What's hot (20)

Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
Three Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityThree Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking Observability
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache...
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 

Similar to Data Management Overview

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And HadoopEdureka!
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsIDERA Software
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 
Qiagram
QiagramQiagram
Qiagramjwppz
 

Similar to Data Management Overview (20)

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
GDSC Cloud Jam.pptx
GDSC Cloud Jam.pptxGDSC Cloud Jam.pptx
GDSC Cloud Jam.pptx
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Qiagram
QiagramQiagram
Qiagram
 

More from Sergey Karayev

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Sergey Karayev
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningSergey Karayev
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSergey Karayev
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningSergey Karayev
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019Sergey Karayev
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Sergey Karayev
 

More from Sergey Karayev (19)

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep Learning
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep Learning
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Data Management Overview

  • 1. Full Stack Deep Learning Josh Tobin, Sergey Karayev, Pieter Abbeel Data Management
  • 2. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Data Management - overview
  • 3. Full Stack Deep Learning 3 https://veekaybee.github.io/2019/02/13/data-science-is-different/ Data Management - overview
  • 4. Full Stack Deep Learning In this module 4 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 5. Full Stack Deep Learning In this module 5 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 6. Full Stack Deep Learning • Most DL applications require lots of labeled data • Exceptions: RL, GANs, "semi-supervised" learning • Publicly available datasets = No competitive advantage • But can serve as starting point 6 Sources Data Management - sources
  • 7. Full Stack Deep Learning Usually: spend $$$ and time to label own data 7 https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg Data Management - sources
  • 8. Full Stack Deep Learning 8Data Management - sources Data flywheel Enables rapid improvement with user labels
  • 9. Full Stack Deep Learning 9 Data Management - sources https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html Semi-supervised learning Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image source: Doersch et al., 2015) Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk) Use parts of data to label other parts
  • 10. Full Stack Deep Learning Image data augmentation • Must do for training vision models • Keras, fast.ai provide functions that do this • Hook into your Dataset class to do in CPU workers in parallel to GPU training 10 https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c Data Management - sources
  • 11. Full Stack Deep Learning Other data augmentation • Tabular • Delete some cells to simulate missing data • Text • No well established techniques, but replace words with synonyms, change order of things. • Speech/video • Change speed, inserts pauses, etc 11 https://github.com/makcedward/nlpaug Data Management - sources
  • 12. Full Stack Deep Learning 12 https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/ Data Management - sources Synthetic data Underrated idea that is almost always worth starting with
  • 13. Full Stack Deep Learning 13 This can get pretty deep! Andrew Moffat - https://github.com/amoffat/metabrite-receipt-tests Data Management - sources
  • 14. Full Stack Deep Learning 14 https://microsoft.github.io/AirSim/ Data Management - sources https://openai.com/blog/ingredients-for-robotics-research/ Especially for driving and robotics
  • 15. Full Stack Deep Learning Questions? 15
  • 16. Full Stack Deep Learning In this module 16 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 17. Full Stack Deep Learning Data Labeling 1. User Interfaces 2. Sources of labor 3. Service companies 17Data Management - labeling
  • 18. Full Stack Deep Learning 18 Standard set of features: - bounding boxes, segmentations, keypoints, cuboids - set of applicable classes Data Management - labeling
  • 19. Full Stack Deep Learning 19 Training the annotators is crucial Quality assurance is key Data Management - labeling
  • 20. Full Stack Deep Learning 20 • Hire own annotators, promote best ones to quality control • Pros: secure, fast (once hired), less QC needed • Cons: expensive, slow to scale, admin overhead • ...or, crowdsource (Mechanical Turk) • Pros: cheaper, more scalable • Cons: not secure, significant QC effort required • ...or, full-service data labeling companies Sources of Labor Data Management - labeling
  • 21. Full Stack Deep Learning Service Companies 21 • Data labeling requires separate software stack, temporary labor, and quality assurance. Makes sense to outsource. • Dedicate several days to selecting the best one for you: • Label gold standard data yourself • Sales calls with several contenders, ask for work sample on same data • Ensure agreement with your gold standard, and evaluate on value Data Management - labeling
  • 22. Full Stack Deep Learning 22 FigureEight is the original AI data labeling company Data Management - labeling
  • 23. Full Stack Deep Learning 23 Scale.ai is a dominant up-and-comer Data Management - labeling
  • 24. Full Stack Deep Learning 24 And there are a ton of others Data Management - labeling
  • 25. Full Stack Deep Learning 25 And there are a ton of others Data Management - labeling
  • 26. Full Stack Deep Learning Software 26 • Full-service data labeling is always pricy • But some companies offer their software without labor Data Management - labeling
  • 27. Full Stack Deep Learning 27 Prodigy Data Management - labeling
  • 28. Full Stack Deep Learning 28 Prodigy Data Management - labeling
  • 29. Full Stack Deep Learning 29 • Conclusions • outsource to full-service company if you can afford it • if not, then at least use existing software • hiring part-time makes more sense than trying to make crowdsourcing work Data Management - labeling
  • 30. Full Stack Deep Learning Questions? 30
  • 31. Full Stack Deep Learning In this module 31 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 32. Full Stack Deep Learning Data Storage 1. Building blocks - Filesystem - Object Storage - Database - "Data Lake" 2. What goes where 3. Where to learn more 32Data Management - storage
  • 33. Full Stack Deep Learning Filesystem • Foundational layer of storage. • Fundamental unit is a "file", which can be text or binary, is not versioned, and is easily overwritten. • Can be as simple as a locally mounted disk containing all the files you need. • Can be networked (e.g. NFS): accessible over network by multiple machines. • Can be distributed (e.g. HDFS): stored and accessed over multiple machines. • Be mindful of access patterns! Can be fast, but not parallel. 33Data Management - storage
  • 34. Full Stack Deep Learning Object Storage • An API over the filesystem. GET, PUT, DELETE files to a service, without worrying where they are actually stored. • Fundamental unit is an "object". Usually binary: image, sound file, etc. • Versioning, redundancy can be built into the service. • Can be parallel, but not fast. • Amazon S3 is the canonical example. • For on-prem setup, Ceph is a good solution. 34Data Management - storage
  • 35. Full Stack Deep Learning Database • Persistent, fast, scalable storage and retrieval of structured data. • Mental model: everything is actually in RAM, but software ensures that everything is logged to disk and never lost. • Fundamental unit is a "row": unique id, references to other rows, values in columns. • Not for binary data! Store references instead. • Usually for data that will be accessed again (not logs). • Postgres is the right choice >90% of the time. Best-in-class SQL, great support for unstructured JSON, actively developed. 35Data Management - storage
  • 36. Full Stack Deep Learning SQL • SQL is the right interface for structured data. • If you avoid using it, you will eventually reinvent a slow and crappy subset of it. 36Data Management - storage
  • 37. Full Stack Deep Learning "Data Lake" • Unstructured aggregation of data from multiple sources, e.g. databases, logs, expensive data transformations. • "Schema on read": dump everything in, then transform for specific needs later. 37 https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02 Data Management - storage
  • 38. Full Stack Deep Learning What goes where • Binary data (images, sound files, compressed texts) is stored as objects. • Metadata (labels, user activity) is stored in database. • If need features which are not obtainable from database (logs), set up data lake and a process to aggregate needed data. • At training time, copy the data that is needed onto filesystem (local or networked). 38Data Management - storage
  • 39. Full Stack Deep Learning Where to learn more 39 https://dataintensive.net Data Management - storage
  • 40. Full Stack Deep Learning Questions? 40
  • 41. Full Stack Deep Learning In this module 41 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 42. Full Stack Deep Learning Data Versioning Level 0: unversioned Level 1: versioned via snapshot at training time Level 2: versioned as a mix of assets and code Level 3: specialized data versioning solution 42Data Management - versioning
  • 43. Full Stack Deep Learning Level 0 • Data lives are on filesystem/S3 and database • Problem: Deployments must be versioned. Deployed machine learning models are part code, part data. If data is not versioned, deployed models are not versioned. • Problem you will face: inability to get back to a previous level of performance 43Data Management - versioning
  • 44. Full Stack Deep Learning Level 1 • Data is versioned by storing a snapshot of everything at training time • This allows you to version deployed models, and to get back to past performance, but is super hacky. • Would be far better to be able to version data just as easily as code. 44Data Management - versioning
  • 45. Full Stack Deep Learning Level 2 • Data is versioned as a mix of assets and code. • Heavy files stored in S3, with unique ids. Training data is stored as JSON or similar, referring to these ids and include relevant metadata (labels, user activity, etc). • JSON files can get big, but using git-lfs lets us store them just as easily as code • Can improve further with "lazydata": only syncing files that are needed. • The git signature + of the raw data file defines the version of the dataset • Often helpful to add timestamp 45Data Management - versioning
  • 46. Full Stack Deep Learning Level 3 • Specialized solutions for versioning data. • Avoid these until you can fully explain how they will improve your project. • Leading solutions are DVC, Pachyderm, Quill. 46Data Management - versioning
  • 47. Full Stack Deep Learning DVC 47 1 2 3 4 Data Management - versioning
  • 48. Full Stack Deep Learning Pachyderm 48Data Management - versioning
  • 49. Full Stack Deep Learning Dolt 49 A nice simple solution for versioning databases, that speaks SQL. Data Management - versioning
  • 50. Full Stack Deep Learning Dolt 50Data Management - versioning
  • 51. Full Stack Deep Learning Questions? 51
  • 52. Full Stack Deep Learning In this module 52 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 53. Full Stack Deep Learning Motivational Example •We have to train a photo popularity predictor every night. • For each photo, training data must include: • Metadata such as posting time, title, location • Some features of the user, such as how many times they logged in today. • Outputs of photo classifiers (content, style) 53 In database Need to compute from logs Need to run classifiers Data Management - processing
  • 54. Full Stack Deep Learning 54 Task Dependencies • Some tasks can't be started until other tasks are finished. • Finishing a task should "kick off" its dependencies Data Management - processing
  • 55. Full Stack Deep Learning • Run everything from scratch. If things exist, no need to recompute. 55 https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418 Data Workflows Data Management - processing
  • 56. Full Stack Deep Learning Makefile limitations • What if re-computation needs to depend on content, not on date? • What if the dependencies are not files, but disparate programs and databases? • What if the work needs to be spread over multiple machines? • What if there are 100s of "Makefiles" all executing at the same time, with shared dependencies? 56Data Management - processing
  • 57. Full Stack Deep Learning Airflow 57 https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418 Data Management - processing
  • 58. Full Stack Deep Learning Distributing work • The workflow manager has a queue for the tasks, and manages workers that pull from it, restarting jobs if they fail. 58 http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/ Data Management - processing
  • 59. Full Stack Deep Learning Try to keep things simple • Don't overengineer • e.g. unix: powerful parallelism, streaming, highly optimized 59 https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html 26 minutes 70 seconds 18 seconds in parallel in parallel Data Management - processing
  • 60. Full Stack Deep Learning Questions? 60
  • 61. Full Stack Deep Learning Thank you! 61