SlideShare a Scribd company logo
1 of 45
Download to read offline
ENTERPRISE-SCALE
TOPOLOGICAL DATA ANALYSIS
USING SPARK
Anshuman Mishra, Lawrence Spracklen
Alpine Data
Alpine Data
What we’ll talk about
•  What’s TDA and why should you care
•  Deep dive into Mapper and bottlenecks
•  Betti Mapper - scaling Mapper to the enterprise
Can anyone recognize this?
We built the first open-source scalable
implementation of TDA Mapper
•  Our implementation of Mapper beats a naïve
version on Spark by 8x-11x* for moderate to large
datasets
•  8x: avg. 305 s for Betti vs. non-completion in 2400 s for
Naïve (100,000 x 784 dataset)
•  11x: avg. 45 s for Betti vs. 511 s for Naïve (10,000 x 784
dataset)
•  We used a novel combination of locality-sensitive
hashing on Spark to increase performance
TDA AND MAPPER: WHY SHOULD
WE CARE?
Conventional ML carries the “curse
of dimensionality”
•  As d à∞, all data points are packed away into
corners of a corresponding d-dimensional
hypercube, with little to separate them
•  Instance learners start to choke
•  Detecting anomalies becomes tougher
How does TDA (Mapper) help?
•  “Topological Methods for the Analysis of High Dimensional
Data Sets and 3D Object Recognition”, G. Singh, F. Memoli, G.
Carlsson, Eurographics Symposium on Point-Based Graphics
(2007)
•  Algorithm consumes a dataset and generates a
topological summary of the whole dataset
•  Summary can help identify localized structures in
high-dimensional data
Some examples of Mapper outputs
DEEP DIVE INTO MAPPER
Mapper: The 30,000 ft. view
M x M
distance
matrix
M x N
Mx1
Mx1
.
. .
.. .
.
.
.
. ..
. …
M x M
distance
matrix
Mapper: 1. Choose a Distance Metric
M x N
The 1st step is to choose a distance metric
for the dataset, in order to compute a distance
matrix.
This will be used to capture similarity between
data points.
Some examples of distance metrics are
Euclidean, Hamming, cosine, etc.
M x M
distance
matrix
Mapper: 2. Compute filter functions
M x N
Next, filter functions (aka lenses) are chosen
to map data points to a single value on the real
line.
These filter functions can be based on:
-  Raw features
-  Statistics – mean, median, variance, etc.
-  Geometry – distance to closest data point,
furthest data point, etc.
-  ML algorithm outputs
Usually two such functions are computed on the
dataset.
M x 1
M x 1
M x M
distance
matrix
Mapper: 3. Apply cover & overlap
M x N
M x 1
M x 1
…
Next, the ranges of each filter application are
“chopped up” into overlapping segments or
intervals using two parameters: cover and
overlap
-  Cover (aka resolution) controls how many
intervals each filter range will be chopped
into, e.g. 40,100
-  Overlap controls the degree of overlap
between intervals (e.g. 20%)
Cover
Overlap
M x M
distance
matrix
Mapper: 4. Compute Cartesians
M x N
The next step is to compute the Cartesian
products of the range intervals (from the
previous step) and assign the original data
points to the resulting two-dimensional regions
based on their filter values.
Note that these two-dimensional regions will
overlap due to the parameters set in the
previous step.
In other words, there will be points in common
between these regions.
M x 2
…
M x M
distance
matrix
Mapper: 5. Perform clustering
M x N
The penultimate stage in the Mapper algorithm
is to perform clustering in the original high-
dimensional space for each (overlapping)
region.
Each cluster will be represented by a node;
since regions overlap, some clusters will have
points in common. Their corresponding
nodes will be connected via an unweighted
edge.
The kind of clustering performed is immaterial.
Our implementation uses DBSCAN.
M x 2
…
.
. .
.. .
..
.
. ..
.
Mapper: 6. Build TDA network
Finally, by joining nodes in topological space
(re: clusters in feature space) that have points
in common, one can derive a topological
network in the form of a graph.
Graph coloring can be performed to capture
localized behavior in the dataset and derive
hidden insights from the data.
Open source Mapper implementations
•  Python:
–  Python Mapper, Mullner and Babu: http://danifold.net/mapper/
–  Proof-of-concept Mapper in a Kaggle notebook, @mlwave:
https://www.kaggle.com/triskelion/digit-recognizer/mapping-digits-with-a-t-sne-
lens/notebook
•  R:
–  TDAmapper package
•  Matlab:
–  Original mapper implementation
Alpine TDA
R
Python
Mapper: Computationally expensive!
M x M
distance
matrix
M x N
Mx1
Mx1
.
. .
.. .
.
.
.
. ..
. …
O(N2) is prohibitive
for large datasets
Single-node open source Mappers choke on large datasets (generously
defined as > 10k data points with >100 columns)
Rolling our own Mapper..
•  Our Mapper implementation
–  Built on PySpark 1.6.1
–  Called Betti Mapper
–  Named after Enrico Betti, a famous topologist
X ✔
Multiple ways to scale Mapper
1.  Naïve Spark implementation
ü  Write the Mapper algorithm using (Py)Spark RDDs
–  Distance matrix computation still performed over entire dataset on
driver node
2.  Down-sampling / landmarking (+ Naïve Spark)
ü  Obtain manageable number of samples from dataset
–  Unreasonable to assume global distribution profiles are captured
by samples
3.  LSH Prototyping!!!!?!
What came first?
•  We use Mapper to detect structure in high-dimensional data using the
concept of similarity.
•  BUT we need to measure similarity so we can sample efficiently.
•  We could use stratified sampling, but then what about
•  Unlabeled data?
•  Anomalies and outliers?
•  LSH is a lower-cost first pass capturing similarity for cheap and helping to
scale Mapper
Locality sensitive hashing by
random projection
•  We draw random vectors with same
dimensions as dataset and compute dot
products with each data point
•  If dot product > 0, mark as 1, else 0
•  Random vectors serve to slice feature
space into bins
•  Series of projection bits can be converted
into a single hash number
•  We have found good results by setting # of
random vectors to: floor(log2 |M|)
1
1
1
0
0
0
…
…
Scaling with LSH Prototyping on Spark
1.  Use Locality Sensitive Hashing
(SimHash / Random Projection)
to drop data points into bins
2.  Compute “prototype” points for
each bin corresponding to bin
centroid
–  can also use median to make
prototyping more robust
3.  Use binning information to
compute topological network:
distMxM => distBxB, where B is no. of
prototype points (1 per bin)
ü Fastest scalable implementation
ü # of random vectors controls #
of bins and therefore fidelity of
topological representation
ü LSH binning tends to select
similar points (inter-bin distance >
intra-bin distance)
Betti Mapper
B x B
Distance
Matrix
of prototypes
M x N
Bx1
Bx1
.
. .
.. .
.
.
.
. ..
. …
LSH
M x M
Distance
Matrix:
D(p1, p2) =
D(bin(p1), bin(p2))
Mx1
Mx1
B x N
prototypes
X X
IMPLEMENTATION
PERFORMANCE
Using pyspark
•  Simple to “sparkify” an existing python mapper
implementation
•  Leverage the rich python ML support to greatest
extent
–  Modify only the computational bottlenecks
•  Numpy/Scipy is essential
•  Turnkey Anaconda deployment on CDH
Naïve performanceTime(Years)
•  4TFLOP/s GPGPU (100% util)
•  5K Columns
•  Euclidean distance
Seconds
Decades
Row count
0.000
0.000
0.000
0.000
0.000
0.001
0.010
0.100
1.000
10.000
100.000
1000.000
10K 100K 1M 10M 100M 1B
Our Approach
Build and test three implementations of Mapper
1.  Naïve Mapper on Spark
2.  Mapper on Spark with sampling (5%, 10%, 25%)
3.  Betti Mapper: LSH + Mapper (8v, 12v, 16v)
Test Hardware
Macbook Pro, mid 2014
•  2.5 GHz Intel® Core i7
•  16 GB 1600 MHz DDR3
•  512 GB SSD
Spark Cluster on Amazon EC2
•  Instance type: r3.large
•  Node: 2 vCPU, 15 GB RAM, 32 GB SSD
•  4 workers, 1 driver
•  250 GB SSD EBS as persistent HDFS
•  Amazon Linux, Anaconda 64-bit 4.0.0,
PySpark 1.6.1
Spark Configuration
•  --driver-memory 8g
•  --executor-memory 12g (each)
•  --executor-cores 2
•  No. of executors: 4
Dataset Configuration
Filename Size (MxN) Size (bytes)
MNIST_1k.csv 1000 rows x 784 cols 1.83 MB
MNIST_10k.csv 10,000 rows x 784 cols 18.3 MB
MNIST_100k.csv 100,000 rows x 784 cols 183 MB
MNIST_1000k.csv 1,000,000 rows x 784 cols 1830 MB
The datasets are sampled with replacement from the
original MNIST dataset available for download using
Python’s scikit-learn library (mldata module)
Test Harness
•  Runs test cases on cluster
•  Test case:
–  <mapper type, dataset size, no. of vectors>
•  Terminates when runtime exceeds 40 minutes
Some DAG Snapshots
Graph coloring by median digitClustering and node assignment
X X X 40minutes
Future Work
•  Test other LSH schemes
•  Optimize Spark code and leverage existing
codebases for distributed linear algebra routines
•  Incorporate as a machine learning model on the
Alpine Data platform
Alpine Spark TDA
TDA
Key Takeaways
•  Scaling Mapper algorithm is non-trivial but
possible
•  Gaining control over fidelity of representation is
key to gaining insights from data
•  Open source implementation of Betti Mapper will
be made available after code cleanup! J
References
•  “Topological Methods for the Analysis of High Dimensional Data Sets and 3D
Object Recognition”, G. Singh, F. Memoli, G. Carlsson, Eurographics Symposium
on Point-Based Graphics (2007)
•  “Extracting insights from the shape of complex data using topology”, P. Y. Lum,
G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J.
Carlsson, G. Carlsson, Nature Scientific Reports (2013)
•  “Online generation of locality sensitive hash signatures”, B. V. Durme, A. Lall,
Proceedings of the Association of Computational Linguistics 2010 Conference Short
Papers (2010)
•  PySpark documentation: http://spark.apache.org/docs/latest/api/python/
Acknowledgements
•  Rachel Warren
•  Anya Bida
Alpine is Hiring
•  Platform engineers
•  UX engineers
•  Build engineers
•  Ping me : lawrence@alpinenow.com
Q & (HOPEFULLY) A
THANK YOU.
anshuman@alpinenow.com
lawrence@alpinenow.com

More Related Content

What's hot

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkJen Aman
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with SparkAlpine Data
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveJoydeep Sen Sarma
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersDatabricks
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On SparkSpark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 

What's hot (20)

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspective
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 

Similar to Enterprise Scale Topological Data Analysis Using Spark

Target Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataTarget Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataFrens Jan Rumph
 
Expressing and Exploiting Multi-Dimensional Locality in DASH
Expressing and Exploiting Multi-Dimensional Locality in DASHExpressing and Exploiting Multi-Dimensional Locality in DASH
Expressing and Exploiting Multi-Dimensional Locality in DASHMenlo Systems GmbH
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovA Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovSpark Summit
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
House price prediction
House price predictionHouse price prediction
House price predictionSabahBegum
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classificationijtsrd
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsBita Kazemi
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
DeepLearningProjV3
DeepLearningProjV3DeepLearningProjV3
DeepLearningProjV3Ana Sanchez
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskASI Data Science
 

Similar to Enterprise Scale Topological Data Analysis Using Spark (20)

Target Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataTarget Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big Data
 
Expressing and Exploiting Multi-Dimensional Locality in DASH
Expressing and Exploiting Multi-Dimensional Locality in DASHExpressing and Exploiting Multi-Dimensional Locality in DASH
Expressing and Exploiting Multi-Dimensional Locality in DASH
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovA Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Space time & power.
Space time & power.Space time & power.
Space time & power.
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
DeepLearningProjV3
DeepLearningProjV3DeepLearningProjV3
DeepLearningProjV3
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Enterprise Scale Topological Data Analysis Using Spark

  • 1. ENTERPRISE-SCALE TOPOLOGICAL DATA ANALYSIS USING SPARK Anshuman Mishra, Lawrence Spracklen Alpine Data
  • 3. What we’ll talk about •  What’s TDA and why should you care •  Deep dive into Mapper and bottlenecks •  Betti Mapper - scaling Mapper to the enterprise
  • 5. We built the first open-source scalable implementation of TDA Mapper •  Our implementation of Mapper beats a naïve version on Spark by 8x-11x* for moderate to large datasets •  8x: avg. 305 s for Betti vs. non-completion in 2400 s for Naïve (100,000 x 784 dataset) •  11x: avg. 45 s for Betti vs. 511 s for Naïve (10,000 x 784 dataset) •  We used a novel combination of locality-sensitive hashing on Spark to increase performance
  • 6. TDA AND MAPPER: WHY SHOULD WE CARE?
  • 7. Conventional ML carries the “curse of dimensionality” •  As d à∞, all data points are packed away into corners of a corresponding d-dimensional hypercube, with little to separate them •  Instance learners start to choke •  Detecting anomalies becomes tougher
  • 8. How does TDA (Mapper) help? •  “Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition”, G. Singh, F. Memoli, G. Carlsson, Eurographics Symposium on Point-Based Graphics (2007) •  Algorithm consumes a dataset and generates a topological summary of the whole dataset •  Summary can help identify localized structures in high-dimensional data
  • 9. Some examples of Mapper outputs
  • 10. DEEP DIVE INTO MAPPER
  • 11. Mapper: The 30,000 ft. view M x M distance matrix M x N Mx1 Mx1 . . . .. . . . . . .. . …
  • 12. M x M distance matrix Mapper: 1. Choose a Distance Metric M x N The 1st step is to choose a distance metric for the dataset, in order to compute a distance matrix. This will be used to capture similarity between data points. Some examples of distance metrics are Euclidean, Hamming, cosine, etc.
  • 13. M x M distance matrix Mapper: 2. Compute filter functions M x N Next, filter functions (aka lenses) are chosen to map data points to a single value on the real line. These filter functions can be based on: -  Raw features -  Statistics – mean, median, variance, etc. -  Geometry – distance to closest data point, furthest data point, etc. -  ML algorithm outputs Usually two such functions are computed on the dataset. M x 1 M x 1
  • 14. M x M distance matrix Mapper: 3. Apply cover & overlap M x N M x 1 M x 1 … Next, the ranges of each filter application are “chopped up” into overlapping segments or intervals using two parameters: cover and overlap -  Cover (aka resolution) controls how many intervals each filter range will be chopped into, e.g. 40,100 -  Overlap controls the degree of overlap between intervals (e.g. 20%) Cover Overlap
  • 15. M x M distance matrix Mapper: 4. Compute Cartesians M x N The next step is to compute the Cartesian products of the range intervals (from the previous step) and assign the original data points to the resulting two-dimensional regions based on their filter values. Note that these two-dimensional regions will overlap due to the parameters set in the previous step. In other words, there will be points in common between these regions. M x 2 …
  • 16. M x M distance matrix Mapper: 5. Perform clustering M x N The penultimate stage in the Mapper algorithm is to perform clustering in the original high- dimensional space for each (overlapping) region. Each cluster will be represented by a node; since regions overlap, some clusters will have points in common. Their corresponding nodes will be connected via an unweighted edge. The kind of clustering performed is immaterial. Our implementation uses DBSCAN. M x 2 … . . . .. . .. . . .. .
  • 17. Mapper: 6. Build TDA network Finally, by joining nodes in topological space (re: clusters in feature space) that have points in common, one can derive a topological network in the form of a graph. Graph coloring can be performed to capture localized behavior in the dataset and derive hidden insights from the data.
  • 18. Open source Mapper implementations •  Python: –  Python Mapper, Mullner and Babu: http://danifold.net/mapper/ –  Proof-of-concept Mapper in a Kaggle notebook, @mlwave: https://www.kaggle.com/triskelion/digit-recognizer/mapping-digits-with-a-t-sne- lens/notebook •  R: –  TDAmapper package •  Matlab: –  Original mapper implementation
  • 20. Mapper: Computationally expensive! M x M distance matrix M x N Mx1 Mx1 . . . .. . . . . . .. . … O(N2) is prohibitive for large datasets Single-node open source Mappers choke on large datasets (generously defined as > 10k data points with >100 columns)
  • 21. Rolling our own Mapper.. •  Our Mapper implementation –  Built on PySpark 1.6.1 –  Called Betti Mapper –  Named after Enrico Betti, a famous topologist X ✔
  • 22. Multiple ways to scale Mapper 1.  Naïve Spark implementation ü  Write the Mapper algorithm using (Py)Spark RDDs –  Distance matrix computation still performed over entire dataset on driver node 2.  Down-sampling / landmarking (+ Naïve Spark) ü  Obtain manageable number of samples from dataset –  Unreasonable to assume global distribution profiles are captured by samples 3.  LSH Prototyping!!!!?!
  • 23. What came first? •  We use Mapper to detect structure in high-dimensional data using the concept of similarity. •  BUT we need to measure similarity so we can sample efficiently. •  We could use stratified sampling, but then what about •  Unlabeled data? •  Anomalies and outliers? •  LSH is a lower-cost first pass capturing similarity for cheap and helping to scale Mapper
  • 24. Locality sensitive hashing by random projection •  We draw random vectors with same dimensions as dataset and compute dot products with each data point •  If dot product > 0, mark as 1, else 0 •  Random vectors serve to slice feature space into bins •  Series of projection bits can be converted into a single hash number •  We have found good results by setting # of random vectors to: floor(log2 |M|) 1 1 1 0 0 0 … …
  • 25. Scaling with LSH Prototyping on Spark 1.  Use Locality Sensitive Hashing (SimHash / Random Projection) to drop data points into bins 2.  Compute “prototype” points for each bin corresponding to bin centroid –  can also use median to make prototyping more robust 3.  Use binning information to compute topological network: distMxM => distBxB, where B is no. of prototype points (1 per bin) ü Fastest scalable implementation ü # of random vectors controls # of bins and therefore fidelity of topological representation ü LSH binning tends to select similar points (inter-bin distance > intra-bin distance)
  • 26. Betti Mapper B x B Distance Matrix of prototypes M x N Bx1 Bx1 . . . .. . . . . . .. . … LSH M x M Distance Matrix: D(p1, p2) = D(bin(p1), bin(p2)) Mx1 Mx1 B x N prototypes X X
  • 28. Using pyspark •  Simple to “sparkify” an existing python mapper implementation •  Leverage the rich python ML support to greatest extent –  Modify only the computational bottlenecks •  Numpy/Scipy is essential •  Turnkey Anaconda deployment on CDH
  • 29. Naïve performanceTime(Years) •  4TFLOP/s GPGPU (100% util) •  5K Columns •  Euclidean distance Seconds Decades Row count 0.000 0.000 0.000 0.000 0.000 0.001 0.010 0.100 1.000 10.000 100.000 1000.000 10K 100K 1M 10M 100M 1B
  • 30. Our Approach Build and test three implementations of Mapper 1.  Naïve Mapper on Spark 2.  Mapper on Spark with sampling (5%, 10%, 25%) 3.  Betti Mapper: LSH + Mapper (8v, 12v, 16v)
  • 31. Test Hardware Macbook Pro, mid 2014 •  2.5 GHz Intel® Core i7 •  16 GB 1600 MHz DDR3 •  512 GB SSD Spark Cluster on Amazon EC2 •  Instance type: r3.large •  Node: 2 vCPU, 15 GB RAM, 32 GB SSD •  4 workers, 1 driver •  250 GB SSD EBS as persistent HDFS •  Amazon Linux, Anaconda 64-bit 4.0.0, PySpark 1.6.1
  • 32. Spark Configuration •  --driver-memory 8g •  --executor-memory 12g (each) •  --executor-cores 2 •  No. of executors: 4
  • 33. Dataset Configuration Filename Size (MxN) Size (bytes) MNIST_1k.csv 1000 rows x 784 cols 1.83 MB MNIST_10k.csv 10,000 rows x 784 cols 18.3 MB MNIST_100k.csv 100,000 rows x 784 cols 183 MB MNIST_1000k.csv 1,000,000 rows x 784 cols 1830 MB The datasets are sampled with replacement from the original MNIST dataset available for download using Python’s scikit-learn library (mldata module)
  • 34. Test Harness •  Runs test cases on cluster •  Test case: –  <mapper type, dataset size, no. of vectors> •  Terminates when runtime exceeds 40 minutes
  • 35. Some DAG Snapshots Graph coloring by median digitClustering and node assignment
  • 36. X X X 40minutes
  • 37.
  • 38. Future Work •  Test other LSH schemes •  Optimize Spark code and leverage existing codebases for distributed linear algebra routines •  Incorporate as a machine learning model on the Alpine Data platform
  • 40. Key Takeaways •  Scaling Mapper algorithm is non-trivial but possible •  Gaining control over fidelity of representation is key to gaining insights from data •  Open source implementation of Betti Mapper will be made available after code cleanup! J
  • 41. References •  “Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition”, G. Singh, F. Memoli, G. Carlsson, Eurographics Symposium on Point-Based Graphics (2007) •  “Extracting insights from the shape of complex data using topology”, P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, G. Carlsson, Nature Scientific Reports (2013) •  “Online generation of locality sensitive hash signatures”, B. V. Durme, A. Lall, Proceedings of the Association of Computational Linguistics 2010 Conference Short Papers (2010) •  PySpark documentation: http://spark.apache.org/docs/latest/api/python/
  • 43. Alpine is Hiring •  Platform engineers •  UX engineers •  Build engineers •  Ping me : lawrence@alpinenow.com