SlideShare a Scribd company logo
1 of 22
A Guide to Data Versioning
with MapR Snapshots
IAN DOWNARD
idownard@mapr.com
© 2019 MapR Technologies 3
Machine learning involves lots of trial and
error.
• Experimentation is key.
• Models, configs, and data must be version controlled.
© 2019 MapR Technologies 4
Recurrent Neural Networks (RNNs)
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
A hidden layer
A visible layer
There is no rule of thumb about how many layers you should use.
*LSTM is a special kind of RNN.
© 2019 MapR Technologies 5
Feature Selection: Which input
signals do we want the model to generalize?
Input: How far look back into time-series?
Input Features: Which features to include?
Model structure: How many layers to stack?
Hidden layers: How many units to chain?
Dropout: How often to force “forgetfulness”?
Experimentation in Machine Learning
Neural
Net 1
Neural
Net 2
100 Layers
0.2 Dropout
50 Layers
0.3 Dropout
Prediction
Input
© 2019 MapR Technologies 6
How do you version control ML experiments?
• It’s easy to version control source
code and model configurations.
• But how do you version control
DATA?
© 2019 MapR Technologies 7
How do you version control DATA?
• You can copy it, but don’t.
– copies are slow.
– copies take up too much space. Easily wouldn't fit on a single
machine.
– copies can't be checked into version control (e.g. Github)
• Snapshots are better.
– Snapshots are fast
– Snapshots don't take up any space (initially)
– Snapshots IDs can be checked into version control (e.g. Github)
– Snapshots preserve file ownership properties.
© 2019 MapR Technologies 9
• Immutable (read only)
• Store only incremental
changes needed to roll
back to a point in time.
https://mapr.com/resources/mapr-snapshots/
© 2019 MapR Technologies 10
“You can take a
snapshot of a 1
petabyte cluster in
seconds with no
additional data
storage required.”
https://mapr.com/resources/mapr-snapshots/
© 2019 MapR Technologies 11
Snapshot Implementation
File Copy
Incremental Updates
File F1
Block A
Block B
Block C
© 2019 MapR Technologies 12
Snapshot Implementation
Processing complexity is linear, based on file size (number of blocks).
Storage Complexity is also linear.
File F1
Block A
Block B
Block C
File Copy F1’
Block A
Block B
Block C
© 2019 MapR Technologies 13
Snapshot Implementation
Snapshots point to existing storage blocks.
i.e. They don’t copy data.
File F1
Block A
Block B
Block CSnapshot S1
© 2019 MapR Technologies 14
Snapshot Implementation
File F1
Block A
Block B
Block CSnapshot S1
Block C’
Files changes write to new storage blocks.
© 2019 MapR Technologies 15
Snapshot Implementation
File F1
Block A
Block B
Block CSnapshot S1
Block C’
Snapshot S2
Snapshots capture incremental file changes.
© 2019 MapR Technologies 16
• Storage Complexity is based on the
number of changed blocks.
• It takes about a second to snapshot a
small volume, and several seconds to
snapshot a large one.
– Snapshot processing speed: O(log n)
– File copy processing speed: O(n),
where n is the number of blocks in
the volume.
Snapshot Implementation
File F1
Block A
Block B
Block CSnapshot S1
Block C’
Snapshot S2
© 2019 MapR Technologies 17
Creating Snapshots
© 2019 MapR Technologies 18
Creating Snapshots
Command Line:
REST API:
maprcli volume snapshot create -cluster gcloud.mapr.com 
-snapshotname My_Experiment_01 -volume my_volume
curl -k -X POST 'https://gcloudnodea:8443/rest/volume/snapshot/create?
volume=my_volume&snapshotname=My_Experiment_01' --user mapr:mapr
© 2019 MapR Technologies 19
MapR Snapshots are not just for files!
• Snapshots include MapR-DB tables
• Snapshots include streams and consumer cursors
mapr copytable -src /my_vol/.snapshot/snaptest/my_table
-dst /my_vol/my_table -mapreduce false
mapr copystream -src /my_vol/.snapshot/snaptest/my_stream -dst
/ my_vol/my_stream -mapreduce false
© 2019 MapR Technologies 20
Snapshots are useful to A/B test SQL queries.
“ML orchestration that keeps track of
all your experiments so you can always
answer the question of how a model
was trained.”
Valohai uses MapR
Snapshots to version
control data for
machine learning
automation 
https://youtu.be/dPVMBVZ--Dw
© 2019 MapR Technologies 23
References
https://mapr.com/resources/mapr-snapshots/ http://valohai.com

More Related Content

What's hot

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
Apache NiFi User Guide
Apache NiFi User GuideApache NiFi User Guide
Apache NiFi User GuideDeon Huang
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
PostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQLPostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQLCockroachDB
 
Open stack
Open stackOpen stack
Open stacksvm
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structuresconfluent
 
Chapter1 computer introduction note
Chapter1  computer introduction note Chapter1  computer introduction note
Chapter1 computer introduction note arvind pandey
 

What's hot (20)

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Apache NiFi User Guide
Apache NiFi User GuideApache NiFi User Guide
Apache NiFi User Guide
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
transport protocols
transport protocolstransport protocols
transport protocols
 
IoT Coap
IoT Coap IoT Coap
IoT Coap
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
TCP/IP and UDP protocols
TCP/IP and UDP protocolsTCP/IP and UDP protocols
TCP/IP and UDP protocols
 
Network Layer
Network LayerNetwork Layer
Network Layer
 
IPFS: The Permanent Web
IPFS: The Permanent WebIPFS: The Permanent Web
IPFS: The Permanent Web
 
PostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQLPostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQL
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Open stack
Open stackOpen stack
Open stack
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Chapter1 computer introduction note
Chapter1  computer introduction note Chapter1  computer introduction note
Chapter1 computer introduction note
 

Similar to A Guide to Data Versioning with MapR Snapshots

Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterTim Ellison
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...Stavros Kontopoulos
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
 
SYN207: Newest and coolest NetScaler features you should be jazzed about
SYN207: Newest and coolest NetScaler features you should be jazzed aboutSYN207: Newest and coolest NetScaler features you should be jazzed about
SYN207: Newest and coolest NetScaler features you should be jazzed aboutCitrix
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes
20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes
20200113 - IBM Cloud Côte d'Azur - DeepDive KubernetesIBM France Lab
 
Cw13 journy to the cloud by mohamed el mofty
Cw13 journy to the cloud by mohamed el moftyCw13 journy to the cloud by mohamed el mofty
Cw13 journy to the cloud by mohamed el moftyTheInevitableCloud
 
Sol linux cmg-t_1_1.pptx
Sol linux cmg-t_1_1.pptxSol linux cmg-t_1_1.pptx
Sol linux cmg-t_1_1.pptxBob Sneed
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCDamienCarpy
 
AWS Partner Presentation - Accenture Digital Supply Chain In The Cloud
AWS Partner Presentation - Accenture Digital Supply Chain In The CloudAWS Partner Presentation - Accenture Digital Supply Chain In The Cloud
AWS Partner Presentation - Accenture Digital Supply Chain In The CloudAmazon Web Services
 
From data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFrom data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFogGuru MSCA Project
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016Tim Ellison
 
Cloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless ComputingCloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless Computingmustafa sarac
 
Creating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the CloudCreating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the CloudAlexander Al Basosi
 
"Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon...
"Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon..."Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon...
"Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon...Edge AI and Vision Alliance
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologiesSachin Aggarwal
 

Similar to A Guide to Data Versioning with MapR Snapshots (20)

Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 
SYN207: Newest and coolest NetScaler features you should be jazzed about
SYN207: Newest and coolest NetScaler features you should be jazzed aboutSYN207: Newest and coolest NetScaler features you should be jazzed about
SYN207: Newest and coolest NetScaler features you should be jazzed about
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 
20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes
20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes
20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes
 
Cw13 journy to the cloud by mohamed el mofty
Cw13 journy to the cloud by mohamed el moftyCw13 journy to the cloud by mohamed el mofty
Cw13 journy to the cloud by mohamed el mofty
 
Sol linux cmg-t_1_1.pptx
Sol linux cmg-t_1_1.pptxSol linux cmg-t_1_1.pptx
Sol linux cmg-t_1_1.pptx
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 
AWS Partner Presentation - Accenture Digital Supply Chain In The Cloud
AWS Partner Presentation - Accenture Digital Supply Chain In The CloudAWS Partner Presentation - Accenture Digital Supply Chain In The Cloud
AWS Partner Presentation - Accenture Digital Supply Chain In The Cloud
 
From data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloudFrom data centers to fog computing: the evaporating cloud
From data centers to fog computing: the evaporating cloud
 
Clean sw 3_architecture
Clean sw 3_architectureClean sw 3_architecture
Clean sw 3_architecture
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
Cloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless ComputingCloud Programming Simplified: A Berkeley View on Serverless Computing
Cloud Programming Simplified: A Berkeley View on Serverless Computing
 
Creating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the CloudCreating a Machine Learning Model on the Cloud
Creating a Machine Learning Model on the Cloud
 
"Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon...
"Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon..."Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon...
"Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon...
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

A Guide to Data Versioning with MapR Snapshots

  • 1. A Guide to Data Versioning with MapR Snapshots IAN DOWNARD idownard@mapr.com
  • 2. © 2019 MapR Technologies 3 Machine learning involves lots of trial and error. • Experimentation is key. • Models, configs, and data must be version controlled.
  • 3. © 2019 MapR Technologies 4 Recurrent Neural Networks (RNNs) http://colah.github.io/posts/2015-08-Understanding-LSTMs/ A hidden layer A visible layer There is no rule of thumb about how many layers you should use. *LSTM is a special kind of RNN.
  • 4. © 2019 MapR Technologies 5 Feature Selection: Which input signals do we want the model to generalize? Input: How far look back into time-series? Input Features: Which features to include? Model structure: How many layers to stack? Hidden layers: How many units to chain? Dropout: How often to force “forgetfulness”? Experimentation in Machine Learning Neural Net 1 Neural Net 2 100 Layers 0.2 Dropout 50 Layers 0.3 Dropout Prediction Input
  • 5. © 2019 MapR Technologies 6 How do you version control ML experiments? • It’s easy to version control source code and model configurations. • But how do you version control DATA?
  • 6. © 2019 MapR Technologies 7 How do you version control DATA? • You can copy it, but don’t. – copies are slow. – copies take up too much space. Easily wouldn't fit on a single machine. – copies can't be checked into version control (e.g. Github) • Snapshots are better. – Snapshots are fast – Snapshots don't take up any space (initially) – Snapshots IDs can be checked into version control (e.g. Github) – Snapshots preserve file ownership properties.
  • 7.
  • 8. © 2019 MapR Technologies 9 • Immutable (read only) • Store only incremental changes needed to roll back to a point in time. https://mapr.com/resources/mapr-snapshots/
  • 9. © 2019 MapR Technologies 10 “You can take a snapshot of a 1 petabyte cluster in seconds with no additional data storage required.” https://mapr.com/resources/mapr-snapshots/
  • 10. © 2019 MapR Technologies 11 Snapshot Implementation File Copy Incremental Updates File F1 Block A Block B Block C
  • 11. © 2019 MapR Technologies 12 Snapshot Implementation Processing complexity is linear, based on file size (number of blocks). Storage Complexity is also linear. File F1 Block A Block B Block C File Copy F1’ Block A Block B Block C
  • 12. © 2019 MapR Technologies 13 Snapshot Implementation Snapshots point to existing storage blocks. i.e. They don’t copy data. File F1 Block A Block B Block CSnapshot S1
  • 13. © 2019 MapR Technologies 14 Snapshot Implementation File F1 Block A Block B Block CSnapshot S1 Block C’ Files changes write to new storage blocks.
  • 14. © 2019 MapR Technologies 15 Snapshot Implementation File F1 Block A Block B Block CSnapshot S1 Block C’ Snapshot S2 Snapshots capture incremental file changes.
  • 15. © 2019 MapR Technologies 16 • Storage Complexity is based on the number of changed blocks. • It takes about a second to snapshot a small volume, and several seconds to snapshot a large one. – Snapshot processing speed: O(log n) – File copy processing speed: O(n), where n is the number of blocks in the volume. Snapshot Implementation File F1 Block A Block B Block CSnapshot S1 Block C’ Snapshot S2
  • 16. © 2019 MapR Technologies 17 Creating Snapshots
  • 17. © 2019 MapR Technologies 18 Creating Snapshots Command Line: REST API: maprcli volume snapshot create -cluster gcloud.mapr.com -snapshotname My_Experiment_01 -volume my_volume curl -k -X POST 'https://gcloudnodea:8443/rest/volume/snapshot/create? volume=my_volume&snapshotname=My_Experiment_01' --user mapr:mapr
  • 18. © 2019 MapR Technologies 19 MapR Snapshots are not just for files! • Snapshots include MapR-DB tables • Snapshots include streams and consumer cursors mapr copytable -src /my_vol/.snapshot/snaptest/my_table -dst /my_vol/my_table -mapreduce false mapr copystream -src /my_vol/.snapshot/snaptest/my_stream -dst / my_vol/my_stream -mapreduce false
  • 19. © 2019 MapR Technologies 20 Snapshots are useful to A/B test SQL queries.
  • 20. “ML orchestration that keeps track of all your experiments so you can always answer the question of how a model was trained.”
  • 21. Valohai uses MapR Snapshots to version control data for machine learning automation  https://youtu.be/dPVMBVZ--Dw
  • 22. © 2019 MapR Technologies 23 References https://mapr.com/resources/mapr-snapshots/ http://valohai.com

Editor's Notes

  1. ML works like this, you make assumptions about the data, then you try a range of experiments. Officially we call these “parameterized studies”. Unofficially we call it ”trial and error”. Trial and error leads to lots of versions. So version control is important. And for reproducibility, it’s critical to keep models, model config, and training data together in version control. Also its important for the iterative model development process to be as frictionless as possible, or productivity will suffer greatly.
  2. There is no rule of thumb for the amount of hidden nodes you should use. It is something you have to figure out through trial and error.
  3. Dropout forces better generalization We must specify a loss function and an optimizer function when compiling the model. The loss function is a way of penalizing the model for low accuracy scores. We use binary cross entropy because we have just two classes (1 and 0). The optimizer defines how to adjust neuron weights in response to inaccuracate predictions. The Adam optimizer make sense, because I’ve read that Adam learns fast, is stable over a wide range of learning rates, and has comparatively low memory requirements. Keras uses a default learning rate of 0.001.
  4. Here’s how file storage works. Several blocks
  5. Here’s how file storage works. Several blocks
  6. Here’s how file storage works. Several blocks
  7. Here’s how file storage works. Several blocks
  8. Here’s how file storage works. Several blocks
  9. Here’s how file storage works. Several blocks
  10. ######################################################################## # SNAPSHOT DEMO # PRELIM: # sudo mount -o hard,nolock localhost:/mapr /mapr # ls /mapr # RUN: # doitlive play snapshot_demo.sh --commentecho ######################################################################## #Create a volume maprcli volume create -name my_volume mount -path /my_volume #Create a 1GB file cp yelp_academic_dataset_business.json /mapr/gcloud.cluster.com/my_volume/my_file.json #Create a MapR-DB JSON table mapr importJSON -idField business_id -src /my_volume/my_file.json -dst /my_volume/my_table -mapreduce false #Create a MapR Event Store stream for Apache Kafka maprcli stream create -path /my_volume/my_stream -produceperm p -consumeperm p -topicperm p #Write some data to the stream printf "`seq 1 5`" | /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-producer.sh --topic /my_volume/my_stream:my_topic --broker-list this.will.be.ignored:9092 #Consumer some data from the stream python ~/consumer.py /my_volume/my_stream:my_topic #Observe that the consumer cursor has read all stream messages maprcli stream cursor list -path /my_volume/my_stream #Write a couple more messages to the stream printf "`seq 6 10`" | /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-producer.sh --topic /my_volume/my_stream:my_topic --broker-list this.will.be.ignored:9092 #Observe that the consumer cursor has not yet read all stream messages maprcli stream cursor list -path /my_volume/my_stream #Create a Snapshot maprcli volume snapshot create -cluster gcloud.cluster.com -snapshotname snapshot1 -volume my_volume #List the snapshot maprcli volume snapshot list -cluster gcloud.cluster.com -volume my_volume #Restore data from snapshot cd /mapr/gcloud.cluster.com/my_volume/ cp .snapshot/snapshot1/my_file.json my_file.json2 mapr copytable -src /my_volume/.snapshot/snapshot1/my_table -dst /my_volume/my_table2 -mapreduce false mapr copystream -src /my_volume/.snapshot/snapshot1/my_stream -dst /my_volume/my_stream2 -mapreduce false #Verify that the ACLs are unchanged ls -l stat my_file.json stat my_file.json2 stat my_table stat my_table2 stat my_stream stat my_stream2 #Verify that the data are unchanged diff my_file.json my_file.json2 rm -rf /mapr/gcloud.cluster.com/difftable_output /mapr/gcloud.cluster.com/diffstream_output mapr difftables -src /my_volume/my_table -dst /my_volume/my_table2 -outdir /difftable_output -mapreduce false mapr diffstreams -src /my_volume/my_stream -dst /my_volume/my_stream2 -outdir /diffstream_output -mapreduce false #Verify that stream cursor offsets are unchanged maprcli stream cursor list -path /my_volume/my_stream maprcli stream cursor list -path /my_volume/my_stream2 #So stream consumers can still read from where they left off, like this: python ~/consumer.py /my_volume/my_stream:my_topic python ~/consumer.py /my_volume/my_stream2:my_topic maprcli volume remove -name my_volume -cluster gcloud.cluster.com -force true
  11. Open http://gcloudnodea:8047/
  12. ML orchestration that keeps track of all your experiments so you can always answer the question of how a model was trained, from data to parameters.