A Guide to Data Versioning with MapR Snapshots

•Download as PPTX, PDF•

2 likes•550 views

Experimentation is fundamental to how software is developed for Machine Learning (ML). The procedures used for data preparation, algorithm development, and hyper-parameter tuning are very iterative and frequently depend on trial and error. In order to facilitate this kind of software development you have to track the code, configurations, and data used for ML experiments so you can always answer the question of how a model was trained. However, large training datasets often preclude traditional version control software from being used for this purpose. In these cases, MapR Snapshots provides a highly attractive solution for data versioning. In this presentation you will learn how to version control data in files, tables, and/or streams with MapR Snapshots, and how to identify cases when MapR Snapshots provide significant advantages versus other data versioning techniques.

Technology

A Guide to Data Versioning
with MapR Snapshots
IAN DOWNARD
idownard@mapr.com

© 2019 MapR Technologies 3
Machine learning involves lots of trial and
error.
• Experimentation is key.
• Models, configs, and data must be version controlled.

© 2019 MapR Technologies 4
Recurrent Neural Networks (RNNs)
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
A hidden layer
A visible layer
There is no rule of thumb about how many layers you should use.
*LSTM is a special kind of RNN.

© 2019 MapR Technologies 5
Feature Selection: Which input
signals do we want the model to generalize?
Input: How far look back into time-series?
Input Features: Which features to include?
Model structure: How many layers to stack?
Hidden layers: How many units to chain?
Dropout: How often to force “forgetfulness”?
Experimentation in Machine Learning
Neural
Net 1
Neural
Net 2
100 Layers
0.2 Dropout
50 Layers
0.3 Dropout
Prediction
Input

© 2019 MapR Technologies 6
How do you version control ML experiments?
• It’s easy to version control source
code and model configurations.
• But how do you version control
DATA?

© 2019 MapR Technologies 7
How do you version control DATA?
• You can copy it, but don’t.
– copies are slow.
– copies take up too much space. Easily wouldn't fit on a single
machine.
– copies can't be checked into version control (e.g. Github)
• Snapshots are better.
– Snapshots are fast
– Snapshots don't take up any space (initially)
– Snapshots IDs can be checked into version control (e.g. Github)
– Snapshots preserve file ownership properties.

© 2019 MapR Technologies 9
• Immutable (read only)
• Store only incremental
changes needed to roll
back to a point in time.
https://mapr.com/resources/mapr-snapshots/

© 2019 MapR Technologies 10
“You can take a
snapshot of a 1
petabyte cluster in
seconds with no
additional data
storage required.”
https://mapr.com/resources/mapr-snapshots/

© 2019 MapR Technologies 11
Snapshot Implementation
File Copy
Incremental Updates
File F1
Block A
Block B
Block C

© 2019 MapR Technologies 12
Snapshot Implementation
Processing complexity is linear, based on file size (number of blocks).
Storage Complexity is also linear.
File F1
Block A
Block B
Block C
File Copy F1’
Block A
Block B
Block C

© 2019 MapR Technologies 13
Snapshot Implementation
Snapshots point to existing storage blocks.
i.e. They don’t copy data.
File F1
Block A
Block B
Block CSnapshot S1

© 2019 MapR Technologies 14
Snapshot Implementation
File F1
Block A
Block B
Block CSnapshot S1
Block C’
Files changes write to new storage blocks.

© 2019 MapR Technologies 15
Snapshot Implementation
File F1
Block A
Block B
Block CSnapshot S1
Block C’
Snapshot S2
Snapshots capture incremental file changes.

© 2019 MapR Technologies 16
• Storage Complexity is based on the
number of changed blocks.
• It takes about a second to snapshot a
small volume, and several seconds to
snapshot a large one.
– Snapshot processing speed: O(log n)
– File copy processing speed: O(n),
where n is the number of blocks in
the volume.
Snapshot Implementation
File F1
Block A
Block B
Block CSnapshot S1
Block C’
Snapshot S2

© 2019 MapR Technologies 17
Creating Snapshots

© 2019 MapR Technologies 18
Creating Snapshots
Command Line:
REST API:
maprcli volume snapshot create -cluster gcloud.mapr.com
-snapshotname My_Experiment_01 -volume my_volume
curl -k -X POST 'https://gcloudnodea:8443/rest/volume/snapshot/create?
volume=my_volume&snapshotname=My_Experiment_01' --user mapr:mapr

© 2019 MapR Technologies 19
MapR Snapshots are not just for files!
• Snapshots include MapR-DB tables
• Snapshots include streams and consumer cursors
mapr copytable -src /my_vol/.snapshot/snaptest/my_table
-dst /my_vol/my_table -mapreduce false
mapr copystream -src /my_vol/.snapshot/snaptest/my_stream -dst
/ my_vol/my_stream -mapreduce false

© 2019 MapR Technologies 20
Snapshots are useful to A/B test SQL queries.

“ML orchestration that keeps track of
all your experiments so you can always
answer the question of how a model
was trained.”

Valohai uses MapR
Snapshots to version
control data for
machine learning
automation 
https://youtu.be/dPVMBVZ--Dw

© 2019 MapR Technologies 23
References
https://mapr.com/resources/mapr-snapshots/ http://valohai.com

What's hot

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah

Data ingestion and distribution with apache NiFiLev Brailovskiy

Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN

Apache NiFi Crash Course IntroDataWorks Summit/Hadoop Summit

IoT:what about data storage?DataWorks Summit/Hadoop Summit

Apache NiFi User GuideDeon Huang

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent

transport protocolsSrinivasa Rao

IoT Coap Rajanikanth U

From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky

TCP/IP and UDP protocolsDawood Faheem Abbasi

Network LayerRutwik Jadhav

IPFS: The Permanent WebSivachandran Paramsivam

PostgreSQL and CockroachDB SQLCockroachDB

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

Open stacksvm

Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit

Power of the Log: LSM & Append Only Data Structuresconfluent

Chapter1 computer introduction note arvind pandey

What's hot (20)

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013

Data ingestion and distribution with apache NiFi

Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster

Apache NiFi Crash Course Intro

IoT:what about data storage?

Apache NiFi User Guide

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

transport protocols

IoT Coap

From cache to in-memory data grid. Introduction to Hazelcast.

TCP/IP and UDP protocols

Network Layer

IPFS: The Permanent Web

PostgreSQL and CockroachDB SQL

Dataflow with Apache NiFi

Open stack

Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet

Power of the Log: LSM & Append Only Data Structures

Chapter1 computer introduction note

Similar to A Guide to Data Versioning with MapR Snapshots

Five cool ways the JVM can run Apache Spark fasterTim Ellison

ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...Stavros Kontopoulos

ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks

SYN207: Newest and coolest NetScaler features you should be jazzed aboutCitrix

AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano

An Optics LifeThomas Weible

20200113 - IBM Cloud Côte d'Azur - DeepDive KubernetesIBM France Lab

Cw13 journy to the cloud by mohamed el moftyTheInevitableCloud

Sol linux cmg-t_1_1.pptxBob Sneed

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software

Meetup 2020 - Back to the Basics part 101 : IaCDamienCarpy

AWS Partner Presentation - Accenture Digital Supply Chain In The CloudAmazon Web Services

From data centers to fog computing: the evaporating cloudFogGuru MSCA Project

Clean sw 3_architectureAngelLuisBlasco

Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon

Apache Big Data Europe 2016Tim Ellison

Cloud Programming Simplified: A Berkeley View on Serverless Computingmustafa sarac

Creating a Machine Learning Model on the CloudAlexander Al Basosi

"Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon...Edge AI and Vision Alliance

Comparison of various streaming technologiesSachin Aggarwal

Similar to A Guide to Data Versioning with MapR Snapshots (20)

Five cool ways the JVM can run Apache Spark faster

ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...

ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...

SYN207: Newest and coolest NetScaler features you should be jazzed about

AIST Super Green Cloud: lessons learned from the operation and the performanc...

An Optics Life

20200113 - IBM Cloud Côte d'Azur - DeepDive Kubernetes

Cw13 journy to the cloud by mohamed el mofty

Sol linux cmg-t_1_1.pptx

Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...

Meetup 2020 - Back to the Basics part 101 : IaC

AWS Partner Presentation - Accenture Digital Supply Chain In The Cloud

From data centers to fog computing: the evaporating cloud

Clean sw 3_architecture

Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...

Apache Big Data Europe 2016

Cloud Programming Simplified: A Berkeley View on Serverless Computing

Creating a Machine Learning Model on the Cloud

"Tools and Techniques for Optimizing DNNs on Arm-based Processors with Au-Zon...

Comparison of various streaming technologies

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Build your next Gen AI Breakthrough - April 2024Neo4j

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

CloudStudio User manual (basic edition):comworks

costume and set research powerpoint presentationphoebematthew05

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand

Advanced Test Driven-Development @ php[tek] 2024

Gen AI in Business - Global Trends Report 2024.pdf

Human Factors of XR: Using Human Factors to Design XR Systems

Designing IA for AI - Information Architecture Conference 2024

Build your next Gen AI Breakthrough - April 2024

APIForce Zurich 5 April Automation LPDG

Scanning the Internet for External Cloud Exposures via SSL Certs

Nell’iperspazio con Rocket: il Framework Web di Rust!

Dev Dives: Streamline document processing with UiPath Studio Web

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

SQL Database Design For Developers at php[tek] 2024

SIP trunking in Janus @ Kamailio World 2024

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Pigging Solutions Piggable Sweeping Elbows

CloudStudio User manual (basic edition):

costume and set research powerpoint presentation

Streamlining Python Development: A Guide to a Modern Project Setup

Connect Wave/ connectwave Pitch Deck Presentation

A Guide to Data Versioning with MapR Snapshots

1. A Guide to Data Versioning with MapR Snapshots IAN DOWNARD idownard@mapr.com

3. © 2019 MapR Technologies 4 Recurrent Neural Networks (RNNs) http://colah.github.io/posts/2015-08-Understanding-LSTMs/ A hidden layer A visible layer There is no rule of thumb about how many layers you should use. *LSTM is a special kind of RNN.

4. © 2019 MapR Technologies 5 Feature Selection: Which input signals do we want the model to generalize? Input: How far look back into time-series? Input Features: Which features to include? Model structure: How many layers to stack? Hidden layers: How many units to chain? Dropout: How often to force “forgetfulness”? Experimentation in Machine Learning Neural Net 1 Neural Net 2 100 Layers 0.2 Dropout 50 Layers 0.3 Dropout Prediction Input

6. © 2019 MapR Technologies 7 How do you version control DATA? • You can copy it, but don’t. – copies are slow. – copies take up too much space. Easily wouldn't fit on a single machine. – copies can't be checked into version control (e.g. Github) • Snapshots are better. – Snapshots are fast – Snapshots don't take up any space (initially) – Snapshots IDs can be checked into version control (e.g. Github) – Snapshots preserve file ownership properties.

11. © 2019 MapR Technologies 12 Snapshot Implementation Processing complexity is linear, based on file size (number of blocks). Storage Complexity is also linear. File F1 Block A Block B Block C File Copy F1’ Block A Block B Block C

15. © 2019 MapR Technologies 16 • Storage Complexity is based on the number of changed blocks. • It takes about a second to snapshot a small volume, and several seconds to snapshot a large one. – Snapshot processing speed: O(log n) – File copy processing speed: O(n), where n is the number of blocks in the volume. Snapshot Implementation File F1 Block A Block B Block CSnapshot S1 Block C’ Snapshot S2

17. © 2019 MapR Technologies 18 Creating Snapshots Command Line: REST API: maprcli volume snapshot create -cluster gcloud.mapr.com -snapshotname My_Experiment_01 -volume my_volume curl -k -X POST 'https://gcloudnodea:8443/rest/volume/snapshot/create? volume=my_volume&snapshotname=My_Experiment_01' --user mapr:mapr

18. © 2019 MapR Technologies 19 MapR Snapshots are not just for files! • Snapshots include MapR-DB tables • Snapshots include streams and consumer cursors mapr copytable -src /my_vol/.snapshot/snaptest/my_table -dst /my_vol/my_table -mapreduce false mapr copystream -src /my_vol/.snapshot/snaptest/my_stream -dst / my_vol/my_stream -mapreduce false

20. “ML orchestration that keeps track of all your experiments so you can always answer the question of how a model was trained.”

21. Valohai uses MapR Snapshots to version control data for machine learning automation  https://youtu.be/dPVMBVZ--Dw

Editor's Notes

ML works like this, you make assumptions about the data, then you try a range of experiments. Officially we call these “parameterized studies”. Unofficially we call it ”trial and error”. Trial and error leads to lots of versions. So version control is important. And for reproducibility, it’s critical to keep models, model config, and training data together in version control. Also its important for the iterative model development process to be as frictionless as possible, or productivity will suffer greatly.
There is no rule of thumb for the amount of hidden nodes you should use. It is something you have to figure out through trial and error.
Dropout forces better generalization We must specify a loss function and an optimizer function when compiling the model. The loss function is a way of penalizing the model for low accuracy scores. We use binary cross entropy because we have just two classes (1 and 0). The optimizer defines how to adjust neuron weights in response to inaccuracate predictions. The Adam optimizer make sense, because I’ve read that Adam learns fast, is stable over a wide range of learning rates, and has comparatively low memory requirements. Keras uses a default learning rate of 0.001.
Here’s how file storage works. Several blocks
Here’s how file storage works. Several blocks
Here’s how file storage works. Several blocks
Here’s how file storage works. Several blocks
Here’s how file storage works. Several blocks
Here’s how file storage works. Several blocks
######################################################################## # SNAPSHOT DEMO # PRELIM: # sudo mount -o hard,nolock localhost:/mapr /mapr # ls /mapr # RUN: # doitlive play snapshot_demo.sh --commentecho ######################################################################## #Create a volume maprcli volume create -name my_volume mount -path /my_volume #Create a 1GB file cp yelp_academic_dataset_business.json /mapr/gcloud.cluster.com/my_volume/my_file.json #Create a MapR-DB JSON table mapr importJSON -idField business_id -src /my_volume/my_file.json -dst /my_volume/my_table -mapreduce false #Create a MapR Event Store stream for Apache Kafka maprcli stream create -path /my_volume/my_stream -produceperm p -consumeperm p -topicperm p #Write some data to the stream printf "`seq 1 5`" | /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-producer.sh --topic /my_volume/my_stream:my_topic --broker-list this.will.be.ignored:9092 #Consumer some data from the stream python ~/consumer.py /my_volume/my_stream:my_topic #Observe that the consumer cursor has read all stream messages maprcli stream cursor list -path /my_volume/my_stream #Write a couple more messages to the stream printf "`seq 6 10`" | /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-producer.sh --topic /my_volume/my_stream:my_topic --broker-list this.will.be.ignored:9092 #Observe that the consumer cursor has not yet read all stream messages maprcli stream cursor list -path /my_volume/my_stream #Create a Snapshot maprcli volume snapshot create -cluster gcloud.cluster.com -snapshotname snapshot1 -volume my_volume #List the snapshot maprcli volume snapshot list -cluster gcloud.cluster.com -volume my_volume #Restore data from snapshot cd /mapr/gcloud.cluster.com/my_volume/ cp .snapshot/snapshot1/my_file.json my_file.json2 mapr copytable -src /my_volume/.snapshot/snapshot1/my_table -dst /my_volume/my_table2 -mapreduce false mapr copystream -src /my_volume/.snapshot/snapshot1/my_stream -dst /my_volume/my_stream2 -mapreduce false #Verify that the ACLs are unchanged ls -l stat my_file.json stat my_file.json2 stat my_table stat my_table2 stat my_stream stat my_stream2 #Verify that the data are unchanged diff my_file.json my_file.json2 rm -rf /mapr/gcloud.cluster.com/difftable_output /mapr/gcloud.cluster.com/diffstream_output mapr difftables -src /my_volume/my_table -dst /my_volume/my_table2 -outdir /difftable_output -mapreduce false mapr diffstreams -src /my_volume/my_stream -dst /my_volume/my_stream2 -outdir /diffstream_output -mapreduce false #Verify that stream cursor offsets are unchanged maprcli stream cursor list -path /my_volume/my_stream maprcli stream cursor list -path /my_volume/my_stream2 #So stream consumers can still read from where they left off, like this: python ~/consumer.py /my_volume/my_stream:my_topic python ~/consumer.py /my_volume/my_stream2:my_topic maprcli volume remove -name my_volume -cluster gcloud.cluster.com -force true
Open http://gcloudnodea:8047/
ML orchestration that keeps track of all your experiments so you can always answer the question of how a model was trained, from data to parameters.

A Guide to Data Versioning with MapR Snapshots

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Guide to Data Versioning with MapR Snapshots

Similar to A Guide to Data Versioning with MapR Snapshots (20)

Recently uploaded

Recently uploaded (20)

A Guide to Data Versioning with MapR Snapshots

Editor's Notes