Experimentation is fundamental to how software is developed for Machine Learning (ML). The procedures used for data preparation, algorithm development, and hyper-parameter tuning are very iterative and frequently depend on trial and error. In order to facilitate this kind of software development you have to track the code, configurations, and data used for ML experiments so you can always answer the question of how a model was trained. However, large training datasets often preclude traditional version control software from being used for this purpose. In these cases, MapR Snapshots provides a highly attractive solution for data versioning.
In this presentation you will learn how to version control data in files, tables, and/or streams with MapR Snapshots, and how to identify cases when MapR Snapshots provide significant advantages versus other data versioning techniques.
ML works like this, you make assumptions about the data, then you try a range of experiments.
Officially we call these “parameterized studies”. Unofficially we call it ”trial and error”.
Trial and error leads to lots of versions. So version control is important. And for reproducibility, it’s critical to keep models, model config, and training data together in version control.
Also its important for the iterative model development process to be as frictionless as possible, or productivity will suffer greatly.
There is no rule of thumb for the amount of hidden nodes you should use.
It is something you have to figure out through trial and error.
Dropout forces better generalization
We must specify a loss function and an optimizer function when compiling the model.
The loss function is a way of penalizing the model for low accuracy scores. We use binary cross entropy because we have just two classes (1 and 0).
The optimizer defines how to adjust neuron weights in response to inaccuracate predictions. The Adam optimizer make sense, because I’ve read that Adam learns fast, is stable over a wide range of learning rates, and has comparatively low memory requirements. Keras uses a default learning rate of 0.001.
Here’s how file storage works.
Several blocks
Here’s how file storage works.
Several blocks
Here’s how file storage works.
Several blocks
Here’s how file storage works.
Several blocks
Here’s how file storage works.
Several blocks
Here’s how file storage works.
Several blocks
########################################################################
# SNAPSHOT DEMO
# PRELIM:
# sudo mount -o hard,nolock localhost:/mapr /mapr
# ls /mapr
# RUN:
# doitlive play snapshot_demo.sh --commentecho
########################################################################
#Create a volume
maprcli volume create -name my_volume mount -path /my_volume
#Create a 1GB file
cp yelp_academic_dataset_business.json /mapr/gcloud.cluster.com/my_volume/my_file.json
#Create a MapR-DB JSON table
mapr importJSON -idField business_id -src /my_volume/my_file.json -dst /my_volume/my_table -mapreduce false
#Create a MapR Event Store stream for Apache Kafka
maprcli stream create -path /my_volume/my_stream -produceperm p -consumeperm p -topicperm p
#Write some data to the stream
printf "`seq 1 5`" | /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-producer.sh --topic /my_volume/my_stream:my_topic --broker-list this.will.be.ignored:9092
#Consumer some data from the stream
python ~/consumer.py /my_volume/my_stream:my_topic
#Observe that the consumer cursor has read all stream messages
maprcli stream cursor list -path /my_volume/my_stream
#Write a couple more messages to the stream
printf "`seq 6 10`" | /opt/mapr/kafka/kafka-0.9.0/bin/kafka-console-producer.sh --topic /my_volume/my_stream:my_topic --broker-list this.will.be.ignored:9092
#Observe that the consumer cursor has not yet read all stream messages
maprcli stream cursor list -path /my_volume/my_stream
#Create a Snapshot
maprcli volume snapshot create -cluster gcloud.cluster.com -snapshotname snapshot1 -volume my_volume
#List the snapshot
maprcli volume snapshot list -cluster gcloud.cluster.com -volume my_volume
#Restore data from snapshot
cd /mapr/gcloud.cluster.com/my_volume/
cp .snapshot/snapshot1/my_file.json my_file.json2
mapr copytable -src /my_volume/.snapshot/snapshot1/my_table -dst /my_volume/my_table2 -mapreduce false
mapr copystream -src /my_volume/.snapshot/snapshot1/my_stream -dst /my_volume/my_stream2 -mapreduce false
#Verify that the ACLs are unchanged
ls -l
stat my_file.json
stat my_file.json2
stat my_table
stat my_table2
stat my_stream
stat my_stream2
#Verify that the data are unchanged
diff my_file.json my_file.json2
rm -rf /mapr/gcloud.cluster.com/difftable_output /mapr/gcloud.cluster.com/diffstream_output
mapr difftables -src /my_volume/my_table -dst /my_volume/my_table2 -outdir /difftable_output -mapreduce false
mapr diffstreams -src /my_volume/my_stream -dst /my_volume/my_stream2 -outdir /diffstream_output -mapreduce false
#Verify that stream cursor offsets are unchanged
maprcli stream cursor list -path /my_volume/my_stream
maprcli stream cursor list -path /my_volume/my_stream2
#So stream consumers can still read from where they left off, like this:
python ~/consumer.py /my_volume/my_stream:my_topic
python ~/consumer.py /my_volume/my_stream2:my_topic
maprcli volume remove -name my_volume -cluster gcloud.cluster.com -force true
Open http://gcloudnodea:8047/
ML orchestration that keeps track of all your experiments so you can always answer the question of how a model was trained, from data to parameters.