SlideShare a Scribd company logo
1 of 52
Deep Learning
with DL4J on
Apache Spark:
Yeah it's Cool, but
are You Doing it the
Right Way?
Hello!
I am Guglielmo Iozzia
Associate Director – Business Tech Analysis at
Previuosly at
2
MSD in Ireland + 50 years
Approx. 2,000 employees
Five sites: Ballydine, Brinny, Carlow and
Dublin
$2.5 billion investment to date
Approx 50% MSD’s top 20 products
manufactured here
Export to + 60 countries
€6.1 billion turnover in 2017
2017 + 300 jobs & €280m investment
MSD Biotech, Dublin, coming in 2021
Deep Learning
It is a subset of machine learning where artificial neural networks, algorithms
inspired by the human brain, learn from large amounts of data.
4
Some Practical Applications of Deep Learning
× Computer vision
× Text generation
× NLP and NLU
× Autonomous cars
× Robotics
× Gaming
× Quantitative finance
5
DL4J
It is an Open Source, distributed, Deep Learning
framework written for JVM languages.
6
It is integrated with Hadoop and Apache Spark and
can be used on distributed GPUs and CPUs.
7
DL4J Modules
× DataVec
× Arbiter
× NN
× Datasets
× RL4J
× DL4J-Spark
× ND4J
8
DL4J Code Example
9
Training and Evaluation
Network
Configuration
ND4J
It is an Open Source linear algebra and matrix manipulation library which supports n-dimensional arrays and it
is integrated with Apache Hadoop and Spark.
10
Apache Spark
It is a unified analytics engine for large-scale data
processing.
11
Speed
Apache Spark achieves high
performance for both batch
and streaming data, using a
state-of-the-art DAG
scheduler, a query
optimizer, and a physical
execution engine.
Apache Spark
Ease of Use
Write applications quickly in
Java, Scala, Python, R and
SQL.
12
Generality
Spark provides a stack of
libraries that can be
combined seamlessly in the
same application.
Apache Spark
Runs Everywhere
Spark runs on Hadoop,
Apache Mesos, Kubernetes,
standalone or in the cloud. It
can access diverse data
sources.
13
Why Distributed MNN Training with
DL4J and Apache Spark?
Why this is a powerful combination?
14
DL4J + Apache Spark
× DL4J provides high level API to design, configure train and
evaluate MNNs.
× Spark performances are excellent in particular for
ETL/streaming, but in terms of computation, in a MNN
training context, some data transformation/aggregation
need to be done using a low-level language.
× DL4J uses ND4J, which is a C++ library that provides high
level Scala API to developers.
15
Model Parallelization
DL4J + Apache Spark
Data Parallelization
16
So: What could possibly go wrong?
17
Memory Management
And now, for something (a
little bit) different.
18
Memory Utilization at Training Time
19
Memory Management in DL4J
Memory allocations can be managed using two different
approaches:
×JVM GC and WeakReference tracking
×MemoryWorkspaces
The idea behind both is the same:
once a NDArray is no longer required, the off-heap
memory associated with it should be released so that it
can be reused. 20
Memory Management in DL4J
The difference between the two approaches is:
×JVM GC: when a INDArray is collected by the garbage
collector, its off-heap memory is deallocated, with the
assumption that it is not used elsewhere.
×MemoryWorkspaces: when a INDArray leaves the
workspace scope, its off-heap memory may be reused,
without deallocation and reallocation.
21
Memory Management in DL4J
Please remember that, when a training process uses
workspaces, in order to get the most
from this approach, periodic GC calls need to be
disabled:
Nd4j.getMemoryManager.togglePeriodicGc(false)
or their frequency needs to be reduced:
val gcInterval = 10000 // In milliseconds
Nd4j.getMemoryManager.setAutoGcWindow(gcInterval)
22
Spark & the DL4J
Web UI
A love/hate relationship.
23
The DL4J Training UI
24
Root Cause and Potential Solutions
Dependencies conflict between the DL4J-UI library and
Apache Spark when running in the same JVM.
Two alternatives are available:
×Collect and save the relevant training stats at runtime,
and then visualize them offline later.
×Execute the UI and use its remote functionality into
separate JVMs (servers). Metrics are uploaded from the
Spark master to the UI server.
25
Serialization & ND4J
Kryo me a river.
26
Data Serialization Options in Spark
Data Serialization is the process of converting the in-
memory objects to another format that can be used
to store or send them over the network.
Two options available in Spark:
×Java (default)
×Kryo
27
OOOps!!!
Kryo doesn’t work well with
off-heap data structures.
28
How to Use Kryo Serialization with ND4J?
×Add the ND4J-Kryo dependency to the project
×Configure the Spark application to use the ND4J Kryo
Registrator:
val sparkConf = new SparkConf
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrator", "org.nd4j.Nd4jRegistrator")
29
Data Locality
How data locality affects
performance?
30
Data Locality in Spark
Data locality in Spark means doing computation on the
node where data resides.
In order to optimize processing tasks, Spark tries to place the
execution code as close as possible to the processed data.
×It tries first to move serialized code to the data.
×Sometimes this isn’t possible and the data must be moved to
the executor.
31
Data Locality in Spark
How Spark handles data locality?
×It prefers to schedule all tasks at the best locality level.
×When there is no unprocessed data on any idle executor, it
switches to lower locality levels.
− It can wait until a busy CPU frees up to start a task on
data on the same server.
− It can immediately start a new task in a farther place that
requires moving data there.
32
Data Locality in Spark
Spark typically waits a bit in the hopes that a busy CPU
frees up. Once that timeout (default is 3 sec) expires, it
starts moving the data to the free CPU. But:
×Training neural networks with DL4J is computationally
expensive.
×So the Spark default behavior isn’t an ideal fit for maximizing
cluster utilization.
33
Data Locality in Spark and DL4J
During training on Spark, DL4J ensures that there is
exactly one task per executor: so it is always better to
immediately transfer data to a free executor, rather than
waiting for another one to become free. Computation
time is more important than any network transfer
time.
34
Data Locality in Spark and DL4J
Spark provides the spark.locality.wait configuration
property: it is the timeout (in seconds) to wait before
moving data to a free CPU.
So, when submitting the configuration for a DL4J training
app, we have to set the value of the spark.locality.wait
property to 0.
35
Handling Java Objects with Large
Off-heap Components
The Off-heap, again.
36
Spark and Large Off-heap Objects
Spark has problems handling Java objects with large off-
heap components, in particular in caching or persisting
them.
When working with DL4J, this is a frequent case, as
DataSet and NDArray objects are involved.
37
Spark and Large Off-heap Objects
Spark drops part of a RDD based on the estimated size of
that block. It estimates the size of a block depending on
the selected persistence level.
In case of MEMORY_ONLY or
MEMORY_AND_DISK, the estimate is done by
walking the Java object graph.
This process doesn't take into account the off-heap
memory used by DL4J and ND4J, so Spark under-
estimates the true size of objects like DataSets or
NDArrays. 38
Spark and Large Off-heap Objects
When deciding bewteen keeping or dropping blocks,
Spark considers only the amount of heap memory used.
DataSet and NDArray objects have a very small on-heap
size, then Spark will keep too many of them, causing out
of memory issues as off-heap memory becomes
exhausted.
39
Spark and Large Off-heap Objects
It is then good practice using MEMORY_ONLY_SER or
MEMORY_AND_DISK_SER when persisting a
RDD<DataSet> or a RDD<INDArray>.
This way Spark stores blocks on the JVM heap in
serialized form. Because there is no off-heap memory for
the serialized objects, it can accurately estimate their size,
in so avoiding out of memory issues.
40
Image Pipeline Data
Preparation
One Batch, Two Batch
(Penny and Dime)
41
Convolutional Neural Network (CNN)
42
By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374
Single Machine
You can use DataVec’s
ImageRecordReader.
Image Pipeline Data Preparation
Spark Cluster
Image preprocessing.
43
Image Pipeline Data Preparation
The Spark strategy assume the images are in subdirectories
based on their class labels. Example:
imageRootDir/car/img0.jpg
imageRootDir/car/img1.jpg
...
imageRootDir/truck/img0.jpg
imageRootDir/truck/img1.jpg
...
imageRootDir/motorbike/img0.jpg
imageRootDir/motorbike/img1.jpg
... 44
Image Pipeline Data Preparation (Spark)
The approach is to preprocess the images into batches of files
(ND4J’s FileBatch objects).
The motivation: the original image files typically use efficient
compression (JPEG, PNG, other) which is much more space
and network efficient than a bitmap representation. However, on
a cluster we want to minimize disk reads due to latency issues
with remote storage – one single file read/transfer is faster than
multiple remote file reads.
45
Image Pipeline Data Preparation (Spark)
Step 1 (option 1): Preprocess images locally.
val sourceDirectory = "/home/guglielmo/training_images"
val destinationDirectory = "/home/guglielmo/preprocessed_images"
val batchSize = 32
SparkDataUtils.createFileBatchesLocal(sourceDirectory,
NativeImageLoader.ALLOWED_FORMATS, true, destinationDirectory,
batchSize)
After the preprocessing completes, the destination directory can
be copied to the cluster.
46
Image Pipeline Data Preparation (Spark)
Step 1 (option 2): Preprocess images using Spark.
val sourceDirectory = “hdfs:///data/training_images”
val destinationDirectory = “hdfs:///data/preprocessed_images”
val batchSize = 32
SparkDataUtils.createFileBatchesSpark(sourceDirectory, destinationDirectory,
batchSize, sparkContext)
47
Image Pipeline Data Preparation (Spark)
Step 2: Create a data loader...
val imageHeightWidth = 64
val imageChannels = 3
val labelMaker:PathLabelGenerator = new ParentPathLabelGenerator
val rr = new ImageRecordReader(imageHeightWidth, imageHeightWidth,
imageChannels, labelMaker)
rr.setLabels(new TinyImageNetDataSetIterator(1).getLabels)
val numClasses = TinyImageNetFetcher.NUM_LABELS
val loader = new RecordReaderFileBatchLoader(rr, minibatch, 1, numClasses)
loader.setPreProcessor(new ImagePreProcessingScaler)
48
Image Pipeline Data Preparation (Spark)
Step 2: ...and finally train the model
val trainDataPath = "hdfs:///data/preprocessed_images"
val pathsTrain:JavaRDD<String> = SparkUtils.listPaths(sparkContext,
trainDataPath)
for (i <- 0 until numEpochs) {
sparkNet.fitPaths(pathsTrain, loader)
}
49
All the Details on DL4J and Spark in my
Book
http://tinyurl.com/y9jkvtuy
50
Thanks!
Any questions?
You can find me at
@guglielmoiozzia
https://ie.linkedin.com/in/giozzia
googlielmo.blogspot.com
51
Credits
Special thanks to all the people who made and
released these awesome resources for free:
×Presentation template by SlidesCarnival
52

More Related Content

What's hot

The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingDataWorks Summit
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? DataWorks Summit
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storageDataWorks Summit
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersDataWorks Summit/Hadoop Summit
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...DataStax
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale DataCloudera, Inc.
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerDataWorks Summit
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryDataWorks Summit
 

What's hot (20)

The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 

Similar to Deep Learning with DL4J on Apache Spark: Yeah it's Cool

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Deep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and sparkDeep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and sparkFrançois Garillot
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j featuresAdam Gibson
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKMatt Stubbs
 
Machine Learning with Scala
Machine Learning with ScalaMachine Learning with Scala
Machine Learning with ScalaSusan Eraly
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 

Similar to Deep Learning with DL4J on Apache Spark: Yeah it's Cool (20)

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Deep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and sparkDeep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and spark
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Spark 101
Spark 101Spark 101
Spark 101
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
 
Machine Learning with Scala
Machine Learning with ScalaMachine Learning with Scala
Machine Learning with Scala
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Deep Learning with DL4J on Apache Spark: Yeah it's Cool

  • 1. Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it the Right Way?
  • 2. Hello! I am Guglielmo Iozzia Associate Director – Business Tech Analysis at Previuosly at 2
  • 3. MSD in Ireland + 50 years Approx. 2,000 employees Five sites: Ballydine, Brinny, Carlow and Dublin $2.5 billion investment to date Approx 50% MSD’s top 20 products manufactured here Export to + 60 countries €6.1 billion turnover in 2017 2017 + 300 jobs & €280m investment MSD Biotech, Dublin, coming in 2021
  • 4. Deep Learning It is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. 4
  • 5. Some Practical Applications of Deep Learning × Computer vision × Text generation × NLP and NLU × Autonomous cars × Robotics × Gaming × Quantitative finance 5
  • 6. DL4J It is an Open Source, distributed, Deep Learning framework written for JVM languages. 6
  • 7. It is integrated with Hadoop and Apache Spark and can be used on distributed GPUs and CPUs. 7
  • 8. DL4J Modules × DataVec × Arbiter × NN × Datasets × RL4J × DL4J-Spark × ND4J 8
  • 9. DL4J Code Example 9 Training and Evaluation Network Configuration
  • 10. ND4J It is an Open Source linear algebra and matrix manipulation library which supports n-dimensional arrays and it is integrated with Apache Hadoop and Spark. 10
  • 11. Apache Spark It is a unified analytics engine for large-scale data processing. 11
  • 12. Speed Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. Apache Spark Ease of Use Write applications quickly in Java, Scala, Python, R and SQL. 12
  • 13. Generality Spark provides a stack of libraries that can be combined seamlessly in the same application. Apache Spark Runs Everywhere Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone or in the cloud. It can access diverse data sources. 13
  • 14. Why Distributed MNN Training with DL4J and Apache Spark? Why this is a powerful combination? 14
  • 15. DL4J + Apache Spark × DL4J provides high level API to design, configure train and evaluate MNNs. × Spark performances are excellent in particular for ETL/streaming, but in terms of computation, in a MNN training context, some data transformation/aggregation need to be done using a low-level language. × DL4J uses ND4J, which is a C++ library that provides high level Scala API to developers. 15
  • 16. Model Parallelization DL4J + Apache Spark Data Parallelization 16
  • 17. So: What could possibly go wrong? 17
  • 18. Memory Management And now, for something (a little bit) different. 18
  • 19. Memory Utilization at Training Time 19
  • 20. Memory Management in DL4J Memory allocations can be managed using two different approaches: ×JVM GC and WeakReference tracking ×MemoryWorkspaces The idea behind both is the same: once a NDArray is no longer required, the off-heap memory associated with it should be released so that it can be reused. 20
  • 21. Memory Management in DL4J The difference between the two approaches is: ×JVM GC: when a INDArray is collected by the garbage collector, its off-heap memory is deallocated, with the assumption that it is not used elsewhere. ×MemoryWorkspaces: when a INDArray leaves the workspace scope, its off-heap memory may be reused, without deallocation and reallocation. 21
  • 22. Memory Management in DL4J Please remember that, when a training process uses workspaces, in order to get the most from this approach, periodic GC calls need to be disabled: Nd4j.getMemoryManager.togglePeriodicGc(false) or their frequency needs to be reduced: val gcInterval = 10000 // In milliseconds Nd4j.getMemoryManager.setAutoGcWindow(gcInterval) 22
  • 23. Spark & the DL4J Web UI A love/hate relationship. 23
  • 25. Root Cause and Potential Solutions Dependencies conflict between the DL4J-UI library and Apache Spark when running in the same JVM. Two alternatives are available: ×Collect and save the relevant training stats at runtime, and then visualize them offline later. ×Execute the UI and use its remote functionality into separate JVMs (servers). Metrics are uploaded from the Spark master to the UI server. 25
  • 26. Serialization & ND4J Kryo me a river. 26
  • 27. Data Serialization Options in Spark Data Serialization is the process of converting the in- memory objects to another format that can be used to store or send them over the network. Two options available in Spark: ×Java (default) ×Kryo 27
  • 28. OOOps!!! Kryo doesn’t work well with off-heap data structures. 28
  • 29. How to Use Kryo Serialization with ND4J? ×Add the ND4J-Kryo dependency to the project ×Configure the Spark application to use the ND4J Kryo Registrator: val sparkConf = new SparkConf sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryo.registrator", "org.nd4j.Nd4jRegistrator") 29
  • 30. Data Locality How data locality affects performance? 30
  • 31. Data Locality in Spark Data locality in Spark means doing computation on the node where data resides. In order to optimize processing tasks, Spark tries to place the execution code as close as possible to the processed data. ×It tries first to move serialized code to the data. ×Sometimes this isn’t possible and the data must be moved to the executor. 31
  • 32. Data Locality in Spark How Spark handles data locality? ×It prefers to schedule all tasks at the best locality level. ×When there is no unprocessed data on any idle executor, it switches to lower locality levels. − It can wait until a busy CPU frees up to start a task on data on the same server. − It can immediately start a new task in a farther place that requires moving data there. 32
  • 33. Data Locality in Spark Spark typically waits a bit in the hopes that a busy CPU frees up. Once that timeout (default is 3 sec) expires, it starts moving the data to the free CPU. But: ×Training neural networks with DL4J is computationally expensive. ×So the Spark default behavior isn’t an ideal fit for maximizing cluster utilization. 33
  • 34. Data Locality in Spark and DL4J During training on Spark, DL4J ensures that there is exactly one task per executor: so it is always better to immediately transfer data to a free executor, rather than waiting for another one to become free. Computation time is more important than any network transfer time. 34
  • 35. Data Locality in Spark and DL4J Spark provides the spark.locality.wait configuration property: it is the timeout (in seconds) to wait before moving data to a free CPU. So, when submitting the configuration for a DL4J training app, we have to set the value of the spark.locality.wait property to 0. 35
  • 36. Handling Java Objects with Large Off-heap Components The Off-heap, again. 36
  • 37. Spark and Large Off-heap Objects Spark has problems handling Java objects with large off- heap components, in particular in caching or persisting them. When working with DL4J, this is a frequent case, as DataSet and NDArray objects are involved. 37
  • 38. Spark and Large Off-heap Objects Spark drops part of a RDD based on the estimated size of that block. It estimates the size of a block depending on the selected persistence level. In case of MEMORY_ONLY or MEMORY_AND_DISK, the estimate is done by walking the Java object graph. This process doesn't take into account the off-heap memory used by DL4J and ND4J, so Spark under- estimates the true size of objects like DataSets or NDArrays. 38
  • 39. Spark and Large Off-heap Objects When deciding bewteen keeping or dropping blocks, Spark considers only the amount of heap memory used. DataSet and NDArray objects have a very small on-heap size, then Spark will keep too many of them, causing out of memory issues as off-heap memory becomes exhausted. 39
  • 40. Spark and Large Off-heap Objects It is then good practice using MEMORY_ONLY_SER or MEMORY_AND_DISK_SER when persisting a RDD<DataSet> or a RDD<INDArray>. This way Spark stores blocks on the JVM heap in serialized form. Because there is no off-heap memory for the serialized objects, it can accurately estimate their size, in so avoiding out of memory issues. 40
  • 41. Image Pipeline Data Preparation One Batch, Two Batch (Penny and Dime) 41
  • 42. Convolutional Neural Network (CNN) 42 By Aphex34 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45679374
  • 43. Single Machine You can use DataVec’s ImageRecordReader. Image Pipeline Data Preparation Spark Cluster Image preprocessing. 43
  • 44. Image Pipeline Data Preparation The Spark strategy assume the images are in subdirectories based on their class labels. Example: imageRootDir/car/img0.jpg imageRootDir/car/img1.jpg ... imageRootDir/truck/img0.jpg imageRootDir/truck/img1.jpg ... imageRootDir/motorbike/img0.jpg imageRootDir/motorbike/img1.jpg ... 44
  • 45. Image Pipeline Data Preparation (Spark) The approach is to preprocess the images into batches of files (ND4J’s FileBatch objects). The motivation: the original image files typically use efficient compression (JPEG, PNG, other) which is much more space and network efficient than a bitmap representation. However, on a cluster we want to minimize disk reads due to latency issues with remote storage – one single file read/transfer is faster than multiple remote file reads. 45
  • 46. Image Pipeline Data Preparation (Spark) Step 1 (option 1): Preprocess images locally. val sourceDirectory = "/home/guglielmo/training_images" val destinationDirectory = "/home/guglielmo/preprocessed_images" val batchSize = 32 SparkDataUtils.createFileBatchesLocal(sourceDirectory, NativeImageLoader.ALLOWED_FORMATS, true, destinationDirectory, batchSize) After the preprocessing completes, the destination directory can be copied to the cluster. 46
  • 47. Image Pipeline Data Preparation (Spark) Step 1 (option 2): Preprocess images using Spark. val sourceDirectory = “hdfs:///data/training_images” val destinationDirectory = “hdfs:///data/preprocessed_images” val batchSize = 32 SparkDataUtils.createFileBatchesSpark(sourceDirectory, destinationDirectory, batchSize, sparkContext) 47
  • 48. Image Pipeline Data Preparation (Spark) Step 2: Create a data loader... val imageHeightWidth = 64 val imageChannels = 3 val labelMaker:PathLabelGenerator = new ParentPathLabelGenerator val rr = new ImageRecordReader(imageHeightWidth, imageHeightWidth, imageChannels, labelMaker) rr.setLabels(new TinyImageNetDataSetIterator(1).getLabels) val numClasses = TinyImageNetFetcher.NUM_LABELS val loader = new RecordReaderFileBatchLoader(rr, minibatch, 1, numClasses) loader.setPreProcessor(new ImagePreProcessingScaler) 48
  • 49. Image Pipeline Data Preparation (Spark) Step 2: ...and finally train the model val trainDataPath = "hdfs:///data/preprocessed_images" val pathsTrain:JavaRDD<String> = SparkUtils.listPaths(sparkContext, trainDataPath) for (i <- 0 until numEpochs) { sparkNet.fitPaths(pathsTrain, loader) } 49
  • 50. All the Details on DL4J and Spark in my Book http://tinyurl.com/y9jkvtuy 50
  • 51. Thanks! Any questions? You can find me at @guglielmoiozzia https://ie.linkedin.com/in/giozzia googlielmo.blogspot.com 51
  • 52. Credits Special thanks to all the people who made and released these awesome resources for free: ×Presentation template by SlidesCarnival 52

Editor's Notes

  1. The field of artificial is essentially when machines can do tasks that typically require human intelligence. It encompasses machine learning, where machines can learn by experience and acquire skills without human involvement. Deep learning is a subset of machintelligenceine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. Similarly to how we learn from experience, the deep learning algorithm would perform a task repeatedly, each time tweaking it a little to improve the outcome. We refer to ‘deep learning’ because the neural networks have various (deep) layers that enable learning. Just about any problem that requires “thought” to figure out is a problem deep learning can learn to solve.
  2. http://www.yaronhadad.com/deep-learning-most-amazing-applications/