> Reproducibility with
unstructured data in 3 steps
Dmitry Petrov
DVC.org
|00|
Co-Founder & CEO > Iterative.AI > San Francisco, USA
ex-Data Scientist > Microsoft (BingAds) > Seattle, USA
ex-Head of Lab > St. Petersburg Electrotechnical University > Russia
|HELLO|
Dmitry Petrov
PhD in Computer Science
Twitter: @FullStackML
Creator of
DVC.org project
> Data Analyst → Structured
> Data Scientist→ Semi-structured
> ML engineer → Unstructured
- NLP - text files
- Computer Vision - images
- Multiple files and/or formats
|Unstructured data|
Reproducibility is about storing
and sharing the mapping:
Code + Data → Model
|Unstructured data reproducibility|
Use a central storage for all
data artifacts:
- Datafiles
- Models
|1st - CENTRAL STORAGE|
s3://mybucket/semsegm-proj/
ddd
|2nd - DECOUPLE DATA FROM CODE|
Use dataset metafiles. Do not read
data from code directly.
Individual file → Snapshot
# files-meta.yaml
file1: s3://mybucket/semsegm-proj/raw/file1-ver2020-10-01
file2: s3://mybucket/semsegm-proj/raw/file1-ver2020-07-26
...
model.pkl: s3://mybucket/semsegm-proj/model-5-1.pkl
ddd
Version metrics files in Git
$ cat metrics.json
{
“AUC”: 0.8367073
“TP”: 8614
“Process”: {
“Threshold”: 0.92
…
$ git diff release-sept
…
- "AUC":0.7906391,
+ "AUC":0.8367073,
|3nd - BE METRICS DRIVEN|
II. DVC - Date Version Control
|1st - CENTRAL STORAGE|
ddd
DVC introduces data-remote
$ dvc remote add -d myremote s3://mybucket/semsegm-proj/
$ dvc push
ddd
|2nd - DECOUPLE DATA FROM CODE|
$ dvc add data.tsv
$ cat data.tsv.dvc
outs:
- md5: fadc70dff966edd21b3dd2b0c2755189
path: data.tsv
size: 593310482
$ dvc push data.tsv
ddd
$ dvc metrics diff release-sept
Path Metric Value Change
metrics.json AUC 0.8367073 0.0460682
metrics.json TP 8291 374
$ dvc plots diff release-sept
|3nd - BE METRICS DRIVEN|
> Questions
Email dmitry@iterative.ai
Web http://dvc.org
> Actions
Follow @FullStackML
|THANK YOU|

Reproducibility with Unstructured Data in 3 steps

  • 1.
    > Reproducibility with unstructureddata in 3 steps Dmitry Petrov DVC.org |00|
  • 2.
    Co-Founder & CEO> Iterative.AI > San Francisco, USA ex-Data Scientist > Microsoft (BingAds) > Seattle, USA ex-Head of Lab > St. Petersburg Electrotechnical University > Russia |HELLO| Dmitry Petrov PhD in Computer Science Twitter: @FullStackML Creator of DVC.org project
  • 3.
    > Data Analyst→ Structured > Data Scientist→ Semi-structured > ML engineer → Unstructured - NLP - text files - Computer Vision - images - Multiple files and/or formats |Unstructured data|
  • 4.
    Reproducibility is aboutstoring and sharing the mapping: Code + Data → Model |Unstructured data reproducibility|
  • 5.
    Use a centralstorage for all data artifacts: - Datafiles - Models |1st - CENTRAL STORAGE| s3://mybucket/semsegm-proj/
  • 6.
    ddd |2nd - DECOUPLEDATA FROM CODE| Use dataset metafiles. Do not read data from code directly. Individual file → Snapshot # files-meta.yaml file1: s3://mybucket/semsegm-proj/raw/file1-ver2020-10-01 file2: s3://mybucket/semsegm-proj/raw/file1-ver2020-07-26 ... model.pkl: s3://mybucket/semsegm-proj/model-5-1.pkl
  • 7.
    ddd Version metrics filesin Git $ cat metrics.json { “AUC”: 0.8367073 “TP”: 8614 “Process”: { “Threshold”: 0.92 … $ git diff release-sept … - "AUC":0.7906391, + "AUC":0.8367073, |3nd - BE METRICS DRIVEN|
  • 8.
    II. DVC -Date Version Control
  • 9.
    |1st - CENTRALSTORAGE| ddd DVC introduces data-remote $ dvc remote add -d myremote s3://mybucket/semsegm-proj/ $ dvc push
  • 10.
    ddd |2nd - DECOUPLEDATA FROM CODE| $ dvc add data.tsv $ cat data.tsv.dvc outs: - md5: fadc70dff966edd21b3dd2b0c2755189 path: data.tsv size: 593310482 $ dvc push data.tsv
  • 11.
    ddd $ dvc metricsdiff release-sept Path Metric Value Change metrics.json AUC 0.8367073 0.0460682 metrics.json TP 8291 374 $ dvc plots diff release-sept |3nd - BE METRICS DRIVEN|
  • 12.
    > Questions Email dmitry@iterative.ai Webhttp://dvc.org > Actions Follow @FullStackML |THANK YOU|