Environment for training models

Environment for training models
Dmitry Spodarets
AI Rush

Who am I
Dmitry Spodarets
• Founder and CEO at FlyElephant
• PhD candidate at Odessa National University
• Lecturer at Odessa Polytechnic University
• Organizer of technical conferences about AI,
BigData, HPC, JS, Web Technologies …

Agenda
• Data Science Tools Survey Results
• Computing resources
• Clouds (AWS & Azure)
• Containers (Docker, Singularity)
• FlyElephant platform for Data Science

Data Science Tools Survey
220 data scientist

Datasets
0
10
20
30
40
50
60
70
less than
1 MB
1.1 to 10
MB
11 to 100
MB
101 MB to
1 GB
1.1 to 10
GB
11 to 100
GB
101 GB to
1
Terabyte
1.1 to 10
TB
11 to 100
TB
101 TB to
1
Petabyte
1.1 PB to
10
Petabyte
11 to 100
PB
over 100
PB
Datasets
Datasets

Tools for collecting data
Python 45
R 26
Spark 18
SQL 15
Excel 13
Kafka 11
Pandas 10
custom 8
Hadoop 5
Numpy 5
SAS 5

Tools for storing data
PostgreSQL 37
CSV 31
MySQL 21
Hadoop 16
Excel 15
HDFS 15
Mongodb 15
My Server 12
Oracle 11
Hive 8

Programming languages
Python 151
R 88
SQL 37
Java 32
Scala 22
bash 17
C++ 17
JavaScript 15
C# 13
vba 8
C 6

Libraries
Pandas 88
Numpy 68
scikit-learn 48
scipy 26
dplyr 20
matplotlib 20
ggplot2 15
keras 14
SPARK 13
xgboost 13
Tensorflow 12

Tools for the visualization of data
matplotlib 66
ggplot 40
seaborn 33
Excel 22
Tableau 22
R 19
plotly 13
bokeh 12
d3 11

Clouds
aws 77
none 41
azure 25
google 24
digital ocean 9
OpenStack 7
Watson 1

Computing resources
NVIDIA DGX-1 Deep Learning Supercomputer
170/3 TFLOPS (GPU FP16 / CPU FP32)
nvidia tesla p100
~5 TeraFLOPS
~3 TeraFLOPS

Image Training Performance on GoogLeNet
251,77
425,38
569,1
467,73
791,96
1230,63
0
200
400
600
800
1000
1200
1400
1 GPU (1.86X) 2 GPUs (1.87X) 4 GPUs (2.2X)
Tesla K80 Tesla P100
http://www.nvidia.com/object/caffe-benchmarks.html
imagestrainedpersecond

1080 vs Titan X vs K80 vs P100
0,25
8,8
0,3
10,1
2,9
8,7
5,3
10,6
0
2
4
6
8
10
12
FP32 (Single precision) FP64 (Double precision)
1080 Titan X K80 P100
http://www.nvidia.com/
TFLOPS

Problem
Effective parallelization of algorithms

Computing power (Intel)
• Intel Math Kernel Library (Intel MKL)
Natively supports C, C++ and Fortran Development.
Cross-language compatible with Java, C#, Python and other languages.
• Intel Data Analytics Acceleration Library (Intel DAAL)
Includes Python, C++, and Java APIs and connectors to popular data
sources including Spark and Hadoop.
• Intel MPI Library
Natively supports C,C++ and Fortran development

Clouds
P2-series N-series
16X K80 4X K80
X1-series H-Series
128 vCPU / 1952 GB 16 vCPU / 224 GB
C4-series
36 vCPU / 60 GB
aws.amazon.com/marketplace/ azuremarketplace.microsoft.com

Azure CLI
1. sudo pip install azure-cli
2. az login
3. az group create --name GroupName --location EastUS
4. az vm create --resource-group GroupName --name MyVM --image
Canonical:UbuntuServer:16.04-LTS:latest --size Standard_NC6 --
storage-sku Standard_LRS --admin-username user --ssh-key-value
~/.ssh/id_rsa.pub
5. az vm deallocate --resource-group GroupName --name MyVM
6. az vm start --resource-group GroupName --name MyVM
7. az vm list-ip-addresses --resource-group GroupName --name
MyVM
8. az vm delete --resource-group GroupName --name MyVM
9. az group delete --name GroupName

Data Science images in Azure Marketplace

Data Science images in AWS Marketplace

Docker (Dockerfile)
FROM gcr.io/tensorflow/tensorflow
MAINTAINER Dmitry Spodarets <d.spodarets@flyelephant.net>
RUN apt update && apt -y upgrade && apt -y install git curl wget
CMD /run_jupyter.sh

Docker (build.sh)
#!/bin/bash
function docker_build {
docker build -t $1 ./$1;
docker tag $1 registry.flyelephant.net/$1
docker push registry.flyelephant.net/$1
docker rmi $1 registry.flyelephant.net/$1
}case $1 in
all)
for i in `cat build.list`; do
docker_build $i;
done
;;
*)
docker_build $1;;
esac

Docker Hub
https://hub.docker.com/

Docker
1. docker images
2. docker run --memory 512m --cpus="2" --name mycont registry.flyelephant.net/tensorflow
3. docker exec -i -t mycont bash
4. docker ps
5. docker stats
6. docker stop CONTAINER ID
7. docker start CONTAINER ID
8. docker rm CONTAINER ID

Docker Machine
• Amazon Web Services
• Digital Ocean
• Exoscale
• Generic
• Google Compute Engine
• IBM Softlayer
• Microsoft Azure
• Microsoft Hyper-V
• OpenStack
• Oracle
• VirtualBox
• Rackspace
• VMware Fusion
• VMware v
• Cloud Air
• VMware vSphere
docker-machine create --driver azure --azure-subscription-id subscription-id --azure-
resource-group resourcename --azure-ssh-user user --azure-size machine-name
docker-machine ssh machine-name

Singularity - Containers for Science
• First public release in April 2016, followed by a massive uptake
• HPC Wire Editor’s choice: Top Technologies to Watch for 2017
• Simple integration with resource managers, InfiniBand, GPUs, MPI, file
systems, and supports multiple architectures (x86_64, PPC, ARM, etc..)
• Limits user’s privileges (inside user == outside user)
• No root owned container daemon
• Network images are supported via URIs and all require local caching:
○ docker:// - This will pull a container from Docker Hub
○ http://, https:// - This will pull an image or tarball from the URL, cache and run it
○ shub:// - Pull an image from the Singularity Hub

Singularity - Usage Examples
$ python ./hello.py
Hello World: The Python version is 2.7.5
$ sudo singularity exec --writable /tmp/debian.img apt-get install python
…
$ singularity exec /tmp/debian.img python ./hello.py
Hello World: The Python version is 2.7.12
Webinar "Introduction to Singularity"
https://youtu.be/h5rDnCA3NJA

Network Based Computing Lab
Ohio State University
• High-Performance Big Data (HiBD)
http://hibd.cse.ohio-state.edu/
• High-Performance Deep Learning (HiDL)
http://hidl.cse.ohio-state.edu/

FlyElephant platform for Data Science
We automate Data Science
and help teams to work efficiently.
Computing
resources
Ready-computing
infrastructure
Collaboration
& Sharing
Fast
Deployment
Expert
Community

Ready-computing infrastructure
Jupyter or
other IDE
Automatic
running of tasks
Server or
Cluster

Our resources
• Public Clouds: Azure & AWS.
• Private cloud based on OpenStack.
• HPC-clusters based on SLURM.
• Docker-clusters based on Swarm / Singularity.
• Tools and languages: R, Python, Java, Scala, C/C++, Julia, OpenFOAM, Octave, PyFR,
Scilab, GROMACS, MATLAB, Intel MKL, FlowVision, ANSYS, COMSOL, AVL, Hadoop, Spark,
H2O, Anaconda, Octave, scikit-learn, Tensorflow, Theano, Caffe, etc.
FlyElephant US 1 Cloud (P100, K80, Titan X, FPGA (Xilinx))
• HPC HUB 1: 80 nodes (2 × Xeon E5-2680v2 (20 cores), 64GB RAM, IB FDR) and 240TB storage.
• HPC HUB 2: 100 nodes (2 × Xeon E5-2670v2 (20 cores), 256GB RAM, IB FDR) and 240TB storage.
• HPC HUB 3: 150 nodes (2 × Xeon E5-2650v2 (16 cores), 128GB RAM, 2 × Tesla K80, IB FDR) and 240TB storage.
Advania, CESGA, TACC(17), HLRS (14), LANL(10)

Dmitry Spodarets
d.spodarets@flyelephant.net
www.flyelephant.net

Environment for training models

More Related Content

What's hot

Viewers also liked

Similar to Environment for training models

Recently uploaded

Environment for training models