2. How familiar are you with Spark?
1. What is Apache Spark?
2. I have used Spark
3. I am using Spark in production or I contribute to
its development
2
3. How familiar are you with TensorFlow?
1. What is TensorFlow?
2. I have heard about it
3. I am training my own neural networks
3
4. Founded by the team who
created ApacheSpark
Offers a hosted service:
- Apache Spark in the cloud
- Notebooks
- Clustermanagement
- Productionenvironment
About Databricks
4
6. Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
6
7. Numerical computing for Data
Science
• Queries are data-heavy
• However algorithmsare computation-heavy
• They operate on simple data types: integers,floats,
doubles, vectors, matrices
7
8. The case for speed
• Numerical bottlenecksare good targets for
optimization
• Let data scientists get faster results
• Faster turnaroundfor experimentations
• How can we run these numerical algorithmsfaster?
8
9. Evolution of computing power
9
Failure is not an option:
it is a fact
When you can afford your dedicated chip
GPGPU
Scale out
Scale up
11. Evolution of computing power
• Processorspeedcannotkeep up with memory and
network improvements
• Accessto the processoris the new bottleneck
• ProjectTungstenin Spark: leverage the processor’s
heuristicsfor executingcode and fetching memory
• Does not accountfor the fact that the problem is
numerical
11
12. Asynchronous vs. synchronous
• Asynchronousalgorithms performupdates concurrently
• Spark is synchronousmodel, deep learningframeworks
usuallyasynchronous
• A large number of ML computations are synchronous
• Even deep learningmay benefit from synchronousupdates
12
13. Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
13
14. GPGPUs
14
• Graphics Processing Units for General Purpose
computations
6000
Theoretical peak
throughput
GPU CPU
Theoretical peak
bandwidth
GPU CPU
15. • Library for writing “machine intelligence”algorithms
• Very popular for deep learning and neural networks
• Can also be used for general purpose numerical
computations
• Interface in C++ and Python
15
Google TensorFlow
16. Numerical dataflow with
Tensorflow
16
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output,
{x: 3, y: 5})
x:
int32
y:
int32
mul 3
z
17. Numerical dataflow with Spark
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
output_df = tfs.map_rows(output, df)
output_df.collect()
df: DataFrame[x: int, y: int]
output_df:
DataFrame[x: int, y: int, z: int]
x:
int32
y:
int32
mul 3
z
18. Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
18
19. 19
It is a communication problem
Spark worker process Worker python process
C++
buffer
Python
pickle
Tungsten
binary
format
Python
pickle
Java
object
21. • Estimation of distribution
from samples
• Non-parametric
• Unknown bandwidth
parameter
• Can be evaluatedwith
goodness of fit
An example: kernel density scoring
21
22. • In practice, compute:
with:
• In a nutshell:a complexnumerical function
An example: kernel density scoring
22
23. 23
Speedup
0
60
120
180
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU
Run time (sec)
def score(x: Double): Double = {
val dis = points.map { z_k =>
- (x - z_k) * (x - z_k) / ( 2 * b * b)
}
val minDis = dis.min
val exps = dis.map(d => math.exp(d - minDis))
minDis - math.log(b * N) + math.log(exps.sum)
}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)
sql("select sum(scoreUDF(sample)) from samples").collect()
31. The future
• Integrationwith Tungsten:
• Direct memory copy
• Columnar storage
• Betterintegration with MLlib data types
• GPU instances in Databricks:
Official support coming this summer
31
32. Recap
• Spark: an efficient framework for running
computations on thousands of computers
• TensorFlow: high-performance numerical
framework
• Get the best of both with TensorFrames:
• Simple API for distributed numerical computing
• Can leveragethe hardwareof the cluster
32
33. Try these demos yourself
• TensorFrames source code and documentation:
github.com/tjhunter/tensorframes
spark-packages.org/package/tjhunter/tensorframes
• Demo available lateron Databricks
• The official TensorFlow website:
www.tensorflow.org
• More questions and attending the Spark summit?
We will hold office hours at the Databricks booth.
33