Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Deep Learning avec Spark:
TensorFrames
Deep Learning Pipelines
Timothée Hunter
09 Novembre 2017
Bio
• Timothée (Tim) Hunter
• Ingénieur (Machine Learning) à Databricks
• X04, M.S. Stanford
• Ph.D. de UC Berkeley en Mac...
L’équipe
Le produit
À propos de Databricks
La mission
Regroupe les créateurs de Spark (maintenant Apache Spark), lancé à
U...
Cette présentation
• Spark et Deep Learning: l’état des lieux
• TensorFrames: intégration haute performance avec TensorFlo...
Deep Learning? Apprentissage en
profondeur?
• Un ensemble de techniques qui représente des transformations
par couches suc...
Spark et le Deep Learning
• Spark n’inclut pas de bibliothèque de Deep Learning
• Les bibliothèques existantes souvent dif...
Tendances: matériel
7
Pannes régulières
GPGPU
Plus	de	machines
Machines	plus	puissantes
Tendances: interfaces
8
NLTK
TensorFrames
Deep	Learning	PipelinesTorch
• Graphics Processing Units for General Purpose computations
GPGPUs
9
4.6
Theoretical	peak
throughput
(Tflops,	single	prec...
• Bibliothèque dédiée au “machine intelligence”
• Très populaire pour le Deep Learning
• Mais aussi excellente pour les ca...
Dataflow numérique avec Tensorflow
11
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output...
Dataflow numérique avec Tensorflow
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeh...
C’est un problème de communication
13
Spark	worker	process Worker	python	process
C++
buffer
Python	
pickle
Tungsten	
binar...
TensorFrames: native embedding of
TensorFlow
14
Spark	worker	process
C++
buffer
Tungsten	
binary	
format
Java
object
Exemple: kernel density scoring
15
• Estimateur non-parametrique
• Courramment utilisé pour
lisser des surfaces
• En pratique: calculer
avec:
• Pour simplifier: une belle fonction numérique
Exemple: kernel density scoring
16
Speedup
20
0
50
100
Scala UDF Scala UDF
(optimized)
TensorFrames TensorFrames
+ GPU
Normalized	cost
def score(x:	Double):	...
Speedup
21
0
50
100
Scala UDF Scala UDF
(optimized)
TensorFrames TensorFrames
+ GPU
Normalized	cost
def score(x:	Double):	...
Speedup
22
0
50
100
Scala UDF Scala UDF
(optimized)
TensorFrames TensorFrames
+ GPU
Normalized	cost
def cost_fun(block,	ba...
Speedup
23
0
50
100
Scala UDF Scala UDF
(optimized)
TensorFrames TensorFrames
+ GPU
Normalized	cost
def cost_fun(block,	ba...
Récapitulons
• Spark: excellent pour distribuer des calculs sur des centaines de
machines
• TensorFlow: excellent pour les...
Deep Learning Pipelines: le Deep Learning
avec simplicité
Deep Learning Pipelines: le Deep Learning
avec simplicité
Deep Learning Pipelines
• Libre (Apache 2.0), développée par Dat...
Achetons des chaussures en
ligne…
Resources machines Beaucoup de
manipulations
Labeled
Data
Challenges
Deep Learning Pipelines
Distribution automatique
avec Apache Spark
MLlib Pipelines API Transfer Learning
Réutilisation de
...
Notebook!
“Panda”
Apprentissage supervisé
LeNet
AlexNet
VGG
GoogLeNet
Inception
ResNet50
InceptionV3
ResNet101
InceptionV4
…
“Panda”
Apprentissage supervisé
Embedding
Embedding
Embedding
Embedding
Notebook!
Scalability : Apache Spark
SINGLE NODE
10minutes
6hours
DeepImageFeaturizer.transform(image_data)
Deep Learning Pipelines
10minutes
7lines de code
Distribution élastique
avec Apache Spark
MLlib Pipelines API Transfer Lea...
Merci
Questions?
Deep Learning on Apache Spark: TensorFrames & Deep Learning Pipelines
Deep Learning on Apache Spark: TensorFrames & Deep Learning Pipelines
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x with Richard Garris
Next
Upcoming SlideShare
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x with Richard Garris
Next
Download to read offline and view in fullscreen.

Share

Deep Learning on Apache Spark: TensorFrames & Deep Learning Pipelines

Download to read offline

TensorFrames: Spark + TensorFlow: Since the creation of Apache Spark, I/O throughput has increased at a faster pace than processing speed. In a lot of big data applications, the bottleneck is increasingly the CPU. With the release of Apache Spark 2.0 and Project Tungsten, Spark runs a number of control operations close to the metal. At the same time, there has been a surge of interest in using GPUs (the Graphics Processing Units of video cards) for general purpose applications, and a number of frameworks have been proposed to do numerical computations on GPUs.In this talk, we will discuss how to combine Apache Spark with TensorFlow, a new framework from Google that provides building blocks for Machine Learning computations on GPUs. Through a binding between Spark and TensorFlow called TensorFrames, distributed numerical transforms on Spark DataFrames and Datasets can be expressed in a high-level language and still rely on highly optimized implementations.The developers of the TensorFrames package will provide an overview, a live demo on Databricks and a presentation of the future plans. For experts, this talk will also include some technical details on design decisions, the current implementation, and ongoing work on speed and performance optimizations for numerical applications.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Deep Learning on Apache Spark: TensorFrames & Deep Learning Pipelines

  1. 1. Deep Learning avec Spark: TensorFrames Deep Learning Pipelines Timothée Hunter 09 Novembre 2017
  2. 2. Bio • Timothée (Tim) Hunter • Ingénieur (Machine Learning) à Databricks • X04, M.S. Stanford • Ph.D. de UC Berkeley en Machine Learning • Ancien utilisateur (No 2) de Spark • Contributions à MLlib • Co-créateur des packages TensorFrames, GraphFrames, Deep Learning Pipelines #EUds6 2
  3. 3. L’équipe Le produit À propos de Databricks La mission Regroupe les créateurs de Spark (maintenant Apache Spark), lancé à UC Berkeley en 2009 33 Une platforme unifiée pour l’analyse des données Unified Analytics Platform Simplifier le Big Data Making Big Data Simple Essayez gratuitement: databricks.com
  4. 4. Cette présentation • Spark et Deep Learning: l’état des lieux • TensorFrames: intégration haute performance avec TensorFlow • Deep Learning Pipelines (DLP): le Deep Learning, avec simplicité
  5. 5. Deep Learning? Apprentissage en profondeur? • Un ensemble de techniques qui représente des transformations par couches successives • Applicable en classification, regression, etc. • Populaire dans les années 80 (réseaux de neurones) • De nouveau utilisés (nouveaux algorithmes d’apprentissage, large collections de données, meilleures machines) t
  6. 6. Spark et le Deep Learning • Spark n’inclut pas de bibliothèque de Deep Learning • Les bibliothèques existantes souvent difficiles à utiliser ou lentes • TensorFrames: intégration bas niveau, haute performance • Deep Learning Pipelines: faciliter le Deep Learning, avec Spark
  7. 7. Tendances: matériel 7 Pannes régulières GPGPU Plus de machines Machines plus puissantes
  8. 8. Tendances: interfaces 8 NLTK TensorFrames Deep Learning PipelinesTorch
  9. 9. • Graphics Processing Units for General Purpose computations GPGPUs 9 4.6 Theoretical peak throughput (Tflops, single precision) GPU CPU Theoretical peak bandwidth (GB/s) GPU CPU
  10. 10. • Bibliothèque dédiée au “machine intelligence” • Très populaire pour le Deep Learning • Mais aussi excellente pour les calculs numériques • Interface: C++ et Python (et d’autres langages) Google TensorFlow 10
  11. 11. Dataflow numérique avec Tensorflow 11 x = tf.placeholder(tf.int32, name=“x”) y = tf.placeholder(tf.int32, name=“y”) output = tf.add(x, 3 * y, name=“z”) session = tf.Session() output_value = session.run(output, {x: 3, y: 5}) x: int32 y: int32 mul 3 z
  12. 12. Dataflow numérique avec Tensorflow df = sqlContext.createDataFrame(…) x = tf.placeholder(tf.int32, name=“x”) y = tf.placeholder(tf.int32, name=“y”) output = tf.add(x, 3 * y, name=“z”) output_df = tfs.map_rows(output, df) output_df.collect() df: DataFrame[x: int, y: int] output_df: DataFrame[x: int, y: int, z: int] x: int32 y: int32 mul 3 z
  13. 13. C’est un problème de communication 13 Spark worker process Worker python process C++ buffer Python pickle Tungsten binary format Python pickle Java object
  14. 14. TensorFrames: native embedding of TensorFlow 14 Spark worker process C++ buffer Tungsten binary format Java object
  15. 15. Exemple: kernel density scoring 15 • Estimateur non-parametrique • Courramment utilisé pour lisser des surfaces
  16. 16. • En pratique: calculer avec: • Pour simplifier: une belle fonction numérique Exemple: kernel density scoring 16
  17. 17. Speedup 20 0 50 100 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Normalized cost def score(x: Double): Double = { val dis = points.map { z_k => - (x - z_k) * (x - z_k) / ( 2 * b * b) } val minDis = dis.min val exps = dis.map(d => math.exp(d - minDis)) minDis - math.log(b * N) + math.log(exps.sum) } val scoreUDF = sqlContext.udf.register("scoreUDF", score _) sql("select sum(scoreUDF(sample)) from samples").collect()
  18. 18. Speedup 21 0 50 100 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Normalized cost def score(x: Double): Double = { val dis = new Array[Double](N) var idx = 0 while(idx < N) { val z_k = points(idx) dis(idx) = - (x - z_k) * (x - z_k) / ( 2 * b * b) idx += 1 } val minDis = dis.min var expSum = 0.0 idx = 0 while(idx < N) { expSum += math.exp(dis(idx) - minDis) idx += 1 } minDis - math.log(b * N) + math.log(expSum) } val scoreUDF = sqlContext.udf.register("scoreUDF", score _) sql("select sum(scoreUDF(sample)) from samples").collect()
  19. 19. Speedup 22 0 50 100 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Normalized cost def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”) sample = tfs.block(df, "sample") score = cost_fun(sample, bandwidth=0.5) df.agg(sum(tfs.map_blocks(score, df))).collect()
  20. 20. Speedup 23 0 50 100 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Normalized cost def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”) with device("/gpu"): sample = tfs.block(df, "sample") score = cost_fun(sample, bandwidth=0.5) df.agg(sum(tfs.map_blocks(score, df))).collect()
  21. 21. Récapitulons • Spark: excellent pour distribuer des calculs sur des centaines de machines • TensorFlow: excellent pour les calculs numériques • Le meilleur des deux mondes avec TensorFrames
  22. 22. Deep Learning Pipelines: le Deep Learning avec simplicité
  23. 23. Deep Learning Pipelines: le Deep Learning avec simplicité Deep Learning Pipelines • Libre (Apache 2.0), développée par Databricks • Met l’accent sur la facilité and l’intégration • sans oublier la performance • Langage par défaut: Python • Apache Spark utilisé pour distribuer les tâches • S’intègre dans MLlib (Pipelines) pour faciliter les workflows ML s
  24. 24. Achetons des chaussures en ligne…
  25. 25. Resources machines Beaucoup de manipulations Labeled Data Challenges
  26. 26. Deep Learning Pipelines Distribution automatique avec Apache Spark MLlib Pipelines API Transfer Learning Réutilisation de modèles
  27. 27. Notebook!
  28. 28. “Panda” Apprentissage supervisé LeNet AlexNet VGG GoogLeNet Inception ResNet50 InceptionV3 ResNet101 InceptionV4 …
  29. 29. “Panda” Apprentissage supervisé
  30. 30. Embedding
  31. 31. Embedding
  32. 32. Embedding
  33. 33. Embedding
  34. 34. Notebook!
  35. 35. Scalability : Apache Spark SINGLE NODE 10minutes 6hours DeepImageFeaturizer.transform(image_data)
  36. 36. Deep Learning Pipelines 10minutes 7lines de code Distribution élastique avec Apache Spark MLlib Pipelines API Transfer Learning 0labels
  37. 37. Merci Questions?
  • BrunoNassivet

    Jun. 4, 2020
  • ssuserce170b

    Aug. 7, 2019
  • AndreySichevoy1

    Nov. 19, 2018
  • ChristianHAASFRANGI

    Jun. 5, 2018
  • raistlinkong

    Dec. 1, 2017
  • daffyduke

    Nov. 18, 2017

TensorFrames: Spark + TensorFlow: Since the creation of Apache Spark, I/O throughput has increased at a faster pace than processing speed. In a lot of big data applications, the bottleneck is increasingly the CPU. With the release of Apache Spark 2.0 and Project Tungsten, Spark runs a number of control operations close to the metal. At the same time, there has been a surge of interest in using GPUs (the Graphics Processing Units of video cards) for general purpose applications, and a number of frameworks have been proposed to do numerical computations on GPUs.In this talk, we will discuss how to combine Apache Spark with TensorFlow, a new framework from Google that provides building blocks for Machine Learning computations on GPUs. Through a binding between Spark and TensorFlow called TensorFrames, distributed numerical transforms on Spark DataFrames and Datasets can be expressed in a high-level language and still rely on highly optimized implementations.The developers of the TensorFrames package will provide an overview, a live demo on Databricks and a presentation of the future plans. For experts, this talk will also include some technical details on design decisions, the current implementation, and ongoing work on speed and performance optimizations for numerical applications.

Views

Total views

4,499

On Slideshare

0

From embeds

0

Number of embeds

41

Actions

Downloads

109

Shares

0

Comments

0

Likes

6

×