All AI Roads Lead to Distribution
jim_dowling
dotAI Paris 2018
©2018 Logical Clocks AB. All Rights Reserved
2/38
“Methods that scale with computation
are the future of AI”*
- Rich Sutton (Founding Father of Reinforcement Learning)
* https://www.youtube.com/watch?v=EeMCEQa85tw
©2018 Logical Clocks AB. All Rights Reserved
Massive Increase in Compute for AI*
Distributed Systems
3.5 month-doubling time
*https://blog.openai.com/ai-and-compute
3/38
©2018 Logical Clocks AB. All Rights Reserved
Model Parallelism Data Parallelism
GPU2
GPU3
GPU1Layers1-100
Layer100-199
Layer200-299
Training/Test Data
Predictions
GPU0 GPU1 GPU99
Distributed Filesystem
32 files 32 files 32 files
Batch Size
of 3200
One Big Model on 3 GPUs
Copy of the Model
on each GPU
Aggregate
Gradients
New Model
Barrier
4/38
©2018 Logical Clocks AB. All Rights Reserved
CNNs
Distribution
Facebook vs Google
ImageNet
Mini AI Arms Race
5/38
©2018 Logical Clocks AB. All Rights Reserved
ImageNet
[Image from https://www.slideshare.net/xavigiro/image-classification-on-imagenet-d1l4-2017-upc-deep-learning-for-computer-vision ]
6/38
Improved Accuracy for ImageNet
©2018 Logical Clocks AB. All Rights Reserved
Improvements in ImageNet – Accuracy
81.5
82
82.5
83
83.5
84
84.5
GOOGLE-1 (MAY '17) GOOGLE-2 (MARCH '18) FACEBOOK (MAY '18)
Top-1 Accuracy for ImageNet
ImageNet Top-1
AutoML
with RL
AutoML
with GA
Facebook- https://goo.gl/ERpJyr
Google-1: https://goo.gl/EV7Xv1
Google-2:https://goo.gl/eidnyQ
More Compute
1 Bn Images
More Data
8/38
©2018 Logical Clocks AB. All Rights Reserved
Reduced
Generalization
Error
Hyper
Parameter
Optimization
Larger
Training
Datasets
Better
Regularization
Methods
Better
Optimization
Algorithms
Design
Better
Models
Methods for Improving Model Accuracy
Distribution
9/38
Faster Training of ImageNet
Single GPU
[Image from https://www.matroid.com/scaledml/2018/jeff.pdf]
Clustered GPUs
©2018 Logical Clocks AB. All Rights Reserved
Improvements in ImageNet – Training Time
June’17 Oct’17
FB - https://goo.gl/ERpJyr
Google-1: https://goo.gl/EV7Xv1
Google-2:https://goo.gl/eidnyQ
0
10
20
30
40
50
60
FACEBOOK (JUNE '17) GOOGLE-1 (DEC '17) GOOGLE-2 (MARCH '18)
Training Time (mins) ImageNet >75% top-1
Training Time (mins)
256 Nvidia P100s
256 TPUv2s
256 TPUv2s
11/38
©2018 Logical Clocks AB. All Rights Reserved
0
20
40
60
80
100
120
140
160
FACEBOOK (JUNE '17) GOOGLE-1 (DEC '17) GOOGLE-2 (MARCH '18)
1000s Files/Sec Processed
Files Processed (K/s)
ImageNet – Files/Sec Processed
June’17
FB - https://goo.gl/ERpJyr
Google-1: https://goo.gl/EV7Xv1
Google-2:https://goo.gl/eidnyQ
HDFS
~80K/s
12/38
©2018 Logical Clocks AB. All Rights Reserved
0
200
400
600
800
1000
1200
FACEBOOK GOOGLE-1 GOOGLE-2 HOPSFS
Distributed Filesystem Read Throughput
Files Processed (K/s)
Remove Bottleneck and Keep Scaling
HopsFS - https://goo.gl/yFCsGc
Scale Challenge Winner (2017)
Hops
PLUG for Hops
13/38
©2018 Logical Clocks AB. All Rights Reserved
All AI Roads Lead to Distribution
Distributed
Deep Learning
Hyper
Parameter
Optimization
Distributed
Training
Larger
Training
Datasets
Elastic
Model
Serving
Parallel
Experiments
(Commodity)
GPU Clusters
Auto
ML
14/38
THEY WILL
IGNORE YOU, DISTRIBUTION
UNTIL THEY
NEED YOU DISTRIBUTION
©2018 Logical Clocks AB. All Rights Reserved
Top 10 List: Why Big Data/Distribution sucks
1. I don’t know which is scarier: reading Stephen King’s horror
novel IT or reading a file from HDFS in Python
2. MapReduce – how can a programming model with only 2
parts have O(N2) complexity?
3. I managed to install Hadoop – at the cost of my last
relationship…the Spark died!
4. “Jupyter’s down again” - some service you never heard of
stopped it working….
5. I still don’t know which is worse - a dead service or a dead
slow service.
16/38
©2018 Logical Clocks AB. All Rights Reserved
Top 10 List: why Big Data/Distribution sucks
6. How do I debug this thing on my laptop?
7. Enough with the Dockerfiles and YAML files, when can I
start writing *real* code?
8. Isn’t it supposed to be faster when I add a new server?
9. Do I really have to install the library on *all* of the
servers?
10.Distributed? I don’t even do multi-threaded!
17/38
KubeFlow
GPU/TPU Services in the Cloud and On-Premise
•Managed Cloud Platforms
-RiseML
-FloydHub
-Google Cloud ML
-Microsoft Batch AI
-AWS Sagemaker
•On-Premise/Cloud Platforms
-KubeFlow
-Hops
18/38
©2018 Logical Clocks AB. All Rights Reserved
Hops Open-Source Data Science Platform
19/38
©2018 Logical Clocks AB. All Rights Reserved
PySpark
End-to-End ML Pipeline in Python on Hops
20/38
Data
Collection
Experimentation Training Serving
Feature
Extraction
Data
Transformation
& Verification
Test
TfServingTensorFlow
FLIP-O-RAMA with Python and HDFS/Hops
Left Hand Here
Right
Thumb
Here
Data
Collection
Experimentation Training Serving
Feature
Extraction
Data
Transformation
& Verification
Test
import hops.hdfs as hdfs
features = ["Age", "Occupation", "Sex", …, "Country"]
with open(hdfs.project_path() +
"/TestJob/data/census/adult.data", "r") as trainFile:
train_data=pd.read_csv(trainFile, names=features,
sep=r's*,s*', engine='python', na_values="?")
Data Ingestion with Pandas
Right
Thumb
Here
import hops.hdfs as hdfs
features = ["Age", "Occupation", "Sex", …, "Country"]
h = hdfs.get_fs()
with h.open_file(hdfs.project_path() +
"/TestJob/data/census/adult.data", "r") as trainFile:
train_data=pd.read_csv(trainFile, names=features,
sep=r's*,s*', engine='python', na_values="?")
Data Ingestion with Pandas on Hops
Right
Thumb
Here
Google Facets
24/38
Data
Collection
Experimentation Training Serving
Feature
Extraction
Data
Transformation
& Verification
Test
train_data = pd.read_csv(…)
test_data = pd.read_csv(…)
facets.overview(train_data, test_data)
facets.dive(test_data.to_json(orient='records'))
Google Facets
25/38
Big Data Preparation with PySpark
images = spark.readImages(IMAGE_PATH, recursive = True,
numPartitions=10, sampleRatio = 0.1).cache()
tr = (ImageTransformer().setOutputCol(“transformed”)
.resize(height = 200, width = 200)
.crop(0, 0, height = 180, width = 180) )
smallImages = tr.transform(images).select(“transformed”)
dfutil.saveAsTFRecords(smallImages, OUTPUT_DIR)
26/38
Experimentation and Training
Data
Collection
Experimentation Training Serving
Feature
Extraction
Data
Transformation
& Verification
Test
GridSearch with TensorFlow/Spark on Hops
def train(learning_rate, dropout):
[TensorFlow Code here]
args_dict = {'learning_rate': [0.001, 0.005, 0.01],
'dropout': [0.5, 0.6]}
experiment.launch(spark, train, args_dict)
Launch 6 Spark Executors
28/38
Model Architecture Search on TensorFlow/Hops
def train_cifar10(learning_rate, dropout, num_layers):
[TensorFlow Code here]
dict = {'learning_rate': [0.005, 0.00005],
'dropout': [0.01, 0.99], 'num_layers': [1,3]}
experiment.evolutionary_search(spark, train, dict,
direction='max', popsize=10, generations=3, crossover=0.7,
mutation=0.5)
29/38
Launch 10 Spark Executors
©2018 Logical Clocks AB. All Rights Reserved
Differential Evolution in Tensorboard (1/4)
CIFAR-10
30/38
©2018 Logical Clocks AB. All Rights Reserved
Differential Evolution in Tensorboard (2/4)
CIFAR-10
31/38
©2018 Logical Clocks AB. All Rights Reserved
Differential Evolution in Tensorboard (3/4)
CIFAR-10
32/38
©2018 Logical Clocks AB. All Rights Reserved
Differential Evolution in Tensorboard (4/4)
CIFAR-10
33/38
Distributed Training with Horovod
hvd.init()
opt = hvd.DistributedOptimizer(opt)
if hvd.local_rank()==0:
[TensorFlow Code here]
…..
else:
[TensorFlow Code here]
…..
34/38
©2018 Logical Clocks AB. All Rights Reserved
Model Serving: Monitor + Retrain
35/38
Summary
•The future of Deep Learning is Distributed
https://www.oreilly.com/ideas/distributed-tensorflow
•Hops is a new Data Platform with first-class support for
Python / Deep Learning / ML / Data Governance / GPUs
*https://twitter.com/karpathy/status/972701240017633281
“It is starting to look like deep learning workflows of the future
feature autotuned architectures running with autotuned
compute schedules across arbitrary backends.”
Andrej Karpathy - Head of AI @ Tesla
36/38
©2018 Logical Clocks AB. All Rights Reserved
The Team
Jim Dowling, Seif Haridi, Gautier Berthou, Salman Niazi, Mahmoud
Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios
Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August
Bonds.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram
Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto
Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro,
Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos
Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid
Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al
Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka,
Tobias Johansson , Roberto Bampi, Roshan Sedar.
www.hops.io
@hopshadoop

All AI Roads lead to Distribution - Dot AI

  • 1.
    All AI RoadsLead to Distribution jim_dowling dotAI Paris 2018
  • 2.
    ©2018 Logical ClocksAB. All Rights Reserved 2/38 “Methods that scale with computation are the future of AI”* - Rich Sutton (Founding Father of Reinforcement Learning) * https://www.youtube.com/watch?v=EeMCEQa85tw
  • 3.
    ©2018 Logical ClocksAB. All Rights Reserved Massive Increase in Compute for AI* Distributed Systems 3.5 month-doubling time *https://blog.openai.com/ai-and-compute 3/38
  • 4.
    ©2018 Logical ClocksAB. All Rights Reserved Model Parallelism Data Parallelism GPU2 GPU3 GPU1Layers1-100 Layer100-199 Layer200-299 Training/Test Data Predictions GPU0 GPU1 GPU99 Distributed Filesystem 32 files 32 files 32 files Batch Size of 3200 One Big Model on 3 GPUs Copy of the Model on each GPU Aggregate Gradients New Model Barrier 4/38
  • 5.
    ©2018 Logical ClocksAB. All Rights Reserved CNNs Distribution Facebook vs Google ImageNet Mini AI Arms Race 5/38
  • 6.
    ©2018 Logical ClocksAB. All Rights Reserved ImageNet [Image from https://www.slideshare.net/xavigiro/image-classification-on-imagenet-d1l4-2017-upc-deep-learning-for-computer-vision ] 6/38
  • 7.
  • 8.
    ©2018 Logical ClocksAB. All Rights Reserved Improvements in ImageNet – Accuracy 81.5 82 82.5 83 83.5 84 84.5 GOOGLE-1 (MAY '17) GOOGLE-2 (MARCH '18) FACEBOOK (MAY '18) Top-1 Accuracy for ImageNet ImageNet Top-1 AutoML with RL AutoML with GA Facebook- https://goo.gl/ERpJyr Google-1: https://goo.gl/EV7Xv1 Google-2:https://goo.gl/eidnyQ More Compute 1 Bn Images More Data 8/38
  • 9.
    ©2018 Logical ClocksAB. All Rights Reserved Reduced Generalization Error Hyper Parameter Optimization Larger Training Datasets Better Regularization Methods Better Optimization Algorithms Design Better Models Methods for Improving Model Accuracy Distribution 9/38
  • 10.
    Faster Training ofImageNet Single GPU [Image from https://www.matroid.com/scaledml/2018/jeff.pdf] Clustered GPUs
  • 11.
    ©2018 Logical ClocksAB. All Rights Reserved Improvements in ImageNet – Training Time June’17 Oct’17 FB - https://goo.gl/ERpJyr Google-1: https://goo.gl/EV7Xv1 Google-2:https://goo.gl/eidnyQ 0 10 20 30 40 50 60 FACEBOOK (JUNE '17) GOOGLE-1 (DEC '17) GOOGLE-2 (MARCH '18) Training Time (mins) ImageNet >75% top-1 Training Time (mins) 256 Nvidia P100s 256 TPUv2s 256 TPUv2s 11/38
  • 12.
    ©2018 Logical ClocksAB. All Rights Reserved 0 20 40 60 80 100 120 140 160 FACEBOOK (JUNE '17) GOOGLE-1 (DEC '17) GOOGLE-2 (MARCH '18) 1000s Files/Sec Processed Files Processed (K/s) ImageNet – Files/Sec Processed June’17 FB - https://goo.gl/ERpJyr Google-1: https://goo.gl/EV7Xv1 Google-2:https://goo.gl/eidnyQ HDFS ~80K/s 12/38
  • 13.
    ©2018 Logical ClocksAB. All Rights Reserved 0 200 400 600 800 1000 1200 FACEBOOK GOOGLE-1 GOOGLE-2 HOPSFS Distributed Filesystem Read Throughput Files Processed (K/s) Remove Bottleneck and Keep Scaling HopsFS - https://goo.gl/yFCsGc Scale Challenge Winner (2017) Hops PLUG for Hops 13/38
  • 14.
    ©2018 Logical ClocksAB. All Rights Reserved All AI Roads Lead to Distribution Distributed Deep Learning Hyper Parameter Optimization Distributed Training Larger Training Datasets Elastic Model Serving Parallel Experiments (Commodity) GPU Clusters Auto ML 14/38
  • 15.
    THEY WILL IGNORE YOU,DISTRIBUTION UNTIL THEY NEED YOU DISTRIBUTION
  • 16.
    ©2018 Logical ClocksAB. All Rights Reserved Top 10 List: Why Big Data/Distribution sucks 1. I don’t know which is scarier: reading Stephen King’s horror novel IT or reading a file from HDFS in Python 2. MapReduce – how can a programming model with only 2 parts have O(N2) complexity? 3. I managed to install Hadoop – at the cost of my last relationship…the Spark died! 4. “Jupyter’s down again” - some service you never heard of stopped it working…. 5. I still don’t know which is worse - a dead service or a dead slow service. 16/38
  • 17.
    ©2018 Logical ClocksAB. All Rights Reserved Top 10 List: why Big Data/Distribution sucks 6. How do I debug this thing on my laptop? 7. Enough with the Dockerfiles and YAML files, when can I start writing *real* code? 8. Isn’t it supposed to be faster when I add a new server? 9. Do I really have to install the library on *all* of the servers? 10.Distributed? I don’t even do multi-threaded! 17/38
  • 18.
    KubeFlow GPU/TPU Services inthe Cloud and On-Premise •Managed Cloud Platforms -RiseML -FloydHub -Google Cloud ML -Microsoft Batch AI -AWS Sagemaker •On-Premise/Cloud Platforms -KubeFlow -Hops 18/38
  • 19.
    ©2018 Logical ClocksAB. All Rights Reserved Hops Open-Source Data Science Platform 19/38
  • 20.
    ©2018 Logical ClocksAB. All Rights Reserved PySpark End-to-End ML Pipeline in Python on Hops 20/38 Data Collection Experimentation Training Serving Feature Extraction Data Transformation & Verification Test TfServingTensorFlow
  • 21.
    FLIP-O-RAMA with Pythonand HDFS/Hops Left Hand Here Right Thumb Here Data Collection Experimentation Training Serving Feature Extraction Data Transformation & Verification Test
  • 22.
    import hops.hdfs ashdfs features = ["Age", "Occupation", "Sex", …, "Country"] with open(hdfs.project_path() + "/TestJob/data/census/adult.data", "r") as trainFile: train_data=pd.read_csv(trainFile, names=features, sep=r's*,s*', engine='python', na_values="?") Data Ingestion with Pandas Right Thumb Here
  • 23.
    import hops.hdfs ashdfs features = ["Age", "Occupation", "Sex", …, "Country"] h = hdfs.get_fs() with h.open_file(hdfs.project_path() + "/TestJob/data/census/adult.data", "r") as trainFile: train_data=pd.read_csv(trainFile, names=features, sep=r's*,s*', engine='python', na_values="?") Data Ingestion with Pandas on Hops Right Thumb Here
  • 24.
    Google Facets 24/38 Data Collection Experimentation TrainingServing Feature Extraction Data Transformation & Verification Test
  • 25.
    train_data = pd.read_csv(…) test_data= pd.read_csv(…) facets.overview(train_data, test_data) facets.dive(test_data.to_json(orient='records')) Google Facets 25/38
  • 26.
    Big Data Preparationwith PySpark images = spark.readImages(IMAGE_PATH, recursive = True, numPartitions=10, sampleRatio = 0.1).cache() tr = (ImageTransformer().setOutputCol(“transformed”) .resize(height = 200, width = 200) .crop(0, 0, height = 180, width = 180) ) smallImages = tr.transform(images).select(“transformed”) dfutil.saveAsTFRecords(smallImages, OUTPUT_DIR) 26/38
  • 27.
    Experimentation and Training Data Collection ExperimentationTraining Serving Feature Extraction Data Transformation & Verification Test
  • 28.
    GridSearch with TensorFlow/Sparkon Hops def train(learning_rate, dropout): [TensorFlow Code here] args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6]} experiment.launch(spark, train, args_dict) Launch 6 Spark Executors 28/38
  • 29.
    Model Architecture Searchon TensorFlow/Hops def train_cifar10(learning_rate, dropout, num_layers): [TensorFlow Code here] dict = {'learning_rate': [0.005, 0.00005], 'dropout': [0.01, 0.99], 'num_layers': [1,3]} experiment.evolutionary_search(spark, train, dict, direction='max', popsize=10, generations=3, crossover=0.7, mutation=0.5) 29/38 Launch 10 Spark Executors
  • 30.
    ©2018 Logical ClocksAB. All Rights Reserved Differential Evolution in Tensorboard (1/4) CIFAR-10 30/38
  • 31.
    ©2018 Logical ClocksAB. All Rights Reserved Differential Evolution in Tensorboard (2/4) CIFAR-10 31/38
  • 32.
    ©2018 Logical ClocksAB. All Rights Reserved Differential Evolution in Tensorboard (3/4) CIFAR-10 32/38
  • 33.
    ©2018 Logical ClocksAB. All Rights Reserved Differential Evolution in Tensorboard (4/4) CIFAR-10 33/38
  • 34.
    Distributed Training withHorovod hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: [TensorFlow Code here] ….. else: [TensorFlow Code here] ….. 34/38
  • 35.
    ©2018 Logical ClocksAB. All Rights Reserved Model Serving: Monitor + Retrain 35/38
  • 36.
    Summary •The future ofDeep Learning is Distributed https://www.oreilly.com/ideas/distributed-tensorflow •Hops is a new Data Platform with first-class support for Python / Deep Learning / ML / Data Governance / GPUs *https://twitter.com/karpathy/status/972701240017633281 “It is starting to look like deep learning workflows of the future feature autotuned architectures running with autotuned compute schedules across arbitrary backends.” Andrej Karpathy - Head of AI @ Tesla 36/38
  • 37.
    ©2018 Logical ClocksAB. All Rights Reserved The Team Jim Dowling, Seif Haridi, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson, August Bonds. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka, Tobias Johansson , Roberto Bampi, Roshan Sedar. www.hops.io @hopshadoop

Editor's Notes

  • #3 AI is going distributed. These words come from Rich Sutton, and that is significant because the pre-Deepmind reinforcement learning developed by him and others did not scale with available compute. It was single-agent, single-threaded AI. But that has changed in the last few years as we can see on this chart.
  • #4 If we take AlexNet as the starting point for the deep learning revolution on hardware accelerators, like GPUs, since then, in the last 5 years, we have seen a 300,000 fold increase in the number of compute operations required to train the state-of-the-art deep learning models, recently with Neural Machine translation by Google and Alpha Go Zero by DeepMind. That is a 3 and a half month doubling time. And if we compare that with Moore’s Law - Moore’s law states that the number of transistors on a chip double every 18 months. When Moore’s Law hit the wall in the mid 2000s, it crossed that chasm by going multi-core. This increase in compute in the last 5 years has not happened because Nvidia doubled the number of cores on their GPUs every 3.5 months. Deep Learning also hit a wall around 2016. We crossed that chasm by going distributed. And yes, AlphaGo would not have beaten the world’s best Go Player if it had not been a distributed system.
  • #6 DL years are like dog years – so the last 5 years is a half a lifetime. Let’s look at the last 12 months. anthropomorphized by Yann Le Cunn and Jeff Dean
  • #9 1 billion images with 1500 hashtags from instagram, trained over several weeks on 336 GPUs.
  • #10 Better Optimization algorithms –large-batch second-order methods like K-FAC Better Regularization methods – like Dropout, The other 3 – which were the domain of ML experts can now be the domain of AI itself. Throw compute and storage at them and they will do better than us.
  • #11 The first phase of the Battle begins with who can train DNNs faster
  • #12 Facebook initiated the hostilities….
  • #17 A friend tried to explain Paxos to me – it was painful for both of us.
  • #18 How can it be slower this time? I haven’t changed anything! What do you mean service X have moved? A WHAT caused my system to crash?!? What do you mean the time isn’t the same at all servers? Hey Spark, I just want to collect TBs at the Driver….what’s your problem with that?
  • #19 KubeFlow – Data Scientists write infrastructure (Dockerfiles, Kubernetes YML files) and ML/DL code
  • #20  Code over Visual Programming Languages: Python/R ML Frameworks: scikit-learn, SparkML Notebooks: Jupyter DL Frameworks: Keras-TensorFlow/PyTorch
  • #23 we have this dataset – the census data from the US. The hdfs module we import will just give us a path to a project – not a HDFS path, just a “/Project/dotai” path.
  • #24 Flickorama
  • #27 Higher-Level APIs in TensorFlow Estimator, Experiment, and Dataset tensorflow.contrib.learn.python.learn.estimators Tuner.run_experiment(..) Spark can launch many TensorFlow jobs in parallel for Hyperparameter Tuning One experiment per Mapper
  • #35 Higher-Level APIs in TensorFlow Estimator, Experiment, and Dataset tensorflow.contrib.learn.python.learn.estimators Tuner.run_experiment(..) Spark can launch many TensorFlow jobs in parallel for Hyperparameter Tuning One experiment per Mapper
  • #37 In other words, the future of deep learning is “distributed”.
  • #38  HopsFS is a drop-in replacement for HDFS that can scale Hadoop Filesystem (HDFS) to over 1 million ops/s by migrating the NameNode metadata to an external scale-out in-memory database. This talk will introduce recent improvements in HopsFS: storing small files in the database (both in-memory and on SSD disk tables), a new scalable block-reporting protocol, support for erasure-coding with data locality, and work on multi-data center replication. For small files (under 64-128 KB), HopsFS can reduce read latency to under 10ms, while also improving read throughput by 3-4X and write throughput by >15X. Our new block reporting protocol reduces block reporting traffic by up to 99% for large clusters, at the cost of a small increase in metadata. While our solution for erasure-coding is implemented at the block-level preserving data locality. Finally, our ongoing work on geographic replication points a way forward for HDFS in the cloud, providing data-center level high availability without any performance hit. Like Maslo's hierarchy of needs, the AI hierarchy of needs describes the stepwise requirements that organizations must go through to become data-driven and intelligent in their use of data. First, organizations must reliably ingest and manage their data assets with pipelines - Hadoop being the natural data platform for this first level. Only then, can organizations progress to understanding their data in hindsight with analytics tools, scale-out databases, and visualization. From here, organizations need to transition to predictive analytics, labelling data, building traditional machine learning and deep learning models, and managing features and experiments. From here to hyperscale AI companies, organizations need to scale out their neural network training to tens or hundreds of GPUs and automate their machine learning processes. In this talk, we will traverse this AI hierarchy of needs for Hadoop and TensorFlow. We will start at support for GPUs in YARN, and work our way up to distributing TensorFlow (using parameter server (TensorFlowOnSpark) and AllReduce (Horovod) frameworks) across many servers, while a unified feature store based on HDFS, Hive, and Kafka.