SlideShare a Scribd company logo
Big Data @ LIST
- Overview -
Dr. Francesco Bongiovanni
Credentials
Dr. Francesco Bongiovanni
B.Sc. in Computer Systems from Haute Ecole de Namur (Belgium)
ECTS certificate from Kemi-Tornio Univ. of Applied Sciences (Finland)
M.Sc. in Software Engineering of Distributed Systems, KTH - Royal Institute of Technology (Sweden)
Ph.D. in Computer Science from INRIA Sophia-Antipolis and Univ. of Nice Sophia Antipolis (France) - Oasis team (now Scale)
Post-Doc @ INRIA Grenoble and Joseph Fourier University (UJF) - Erods team (formerly Sardes)
Post-Doc @ Verimag Laboratory (CNRS) - Synchrone team
Expertise
● Scalable distributed systems & algorithms
○ P2P systems (Structured/Unstructured)
● Cloud computing
● Applied formal methods (Isabelle Theorem Prover / TLA+)
● Scalable simulations
● Distributed optimizations
● Cluster computing
francesco.bongiovanni@list.lu (Scholar / homepage / LinkedIn)
Forewords and disclaimer
● The following frameworks, tools,...are all open source projects (a big chunk is from the Apache
Foundation).
● This presentation is just a glimpse of what is available as of today.
● Full blown details are intentionally omitted. (configurations, setups, debugging bindings/frameworks,
advanced examples...which took months to get right)
As a user of this software stack, you are mainly concerned about two things :
1. Programmability; does this stack provides me with the necessary programming abstractions/tools for
expressing my problems ?
2. Fault tolerance; i.e. what if I ran some stuff and tasks/process/computer fails, does that affect my
computation(s) ?
→ lots of abstractions (for doing computation on big graphs, iterative algorithms on large datasets,...) are
provided through various distributed frameworks
→ Fault-tolerance is baked in most of the presented frameworks.
Scientific processing
Scientists want to spend their time exploring, not coding nor
waiting for their computations to be done.
IDEA
Write code Run code
Study
results
Publish
paper
Unproductive time
Never run again
The free lunch is over has been over since 2005 !
● Numbers of cores now is 2-
12 *
● Moore’s law continues
○ doubling of number of
transistors/18 months
Holy Grail: develop software
whose performance scales with
numbers of cores
=> Parallelize or perish...
* Intel announced on 16/02/2015 that 18-cores Xeon
chip avail. before this summer
The free lunch is over has been over since 2005 !
● Numbers of cores now is 2-
12 *
● Moore’s law continues
○ doubling of number of
transistors/18 months
Holy Grail: develop software
whose performance scales with
numbers of cores
=> Parallelize or perish...
* Intel announced on 16/02/2015 that 18-cores Xeon
chip avail. before this summer
The state of things to come...
by adapteva
Modern hardware layout
Modern hardware layout
A technical perspective: scale and deep analysis
Emerging platforms
Software Infrastructure Overview
eScience Cluster (for prototyping purposes)
Mission
● Provide the team with Big Data programming
capabilities
● Program the cluster as if it was one (big) computer
○ Reliable & Scalable Programming Abstractions
● Co-locate various distributed frameworks efficiently
Hardware specs:
● 3 * 2 X QuadCore E5440 @ 2.83 Ghz, 40GB RAM
● 1 * 2 X QuadCore (dualCore) E5550 @ 2.67 Ghz, 48 GB
RAM
=> Total: 40 CPUs, 168G RAM, +/- 1 TB Storage, GLAN
Apache Mesos
● Resource efficient scheduler
● Reliable & Scalable
● Powered by Mesos (Hadoop, Chronos, Marathon, Storm, Spark,
Aurora, Jenkins,…)
Programming capabilities:
● Streaming: able to process data streams (news feeds, twitter
feeds,...)
● Iterative: able to program iterative algorithms (PageRank, K-
means,...)
● Interactive: able to interactively query large volumes of data
● …
eScience Cluster (for prototyping purposes)
Detailed overview
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
(as of Oct. 2015)
SparkR
...
H2O
...
HDFS, v 1.2.1
● Hadoop Distributed File System -> scalable, reliable file system through replication
○ note: a file is immutable when stored on HDFS
● +/- 1 TB of avail. storage for our datasets (wiki dumps, geographical datasets,...)
● Used by Spark and other frameworks to store/retrieve datasets
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
Mesos, v 0.25.0
● Distributed OS Scheduler & tasks executor
● Leverage Linux Cgroups, no virtualization required
● Implements Dominant Resource Fairness alg. for fair task allocation
● See the data center as one big computer
● Use Apache ZooKeeper for master fail-over (distributed coordination alg.)
● !! Common resource layer over which diverse frameworks can run !!
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
Spark, v 1.5.1 *
● Fast and general engine for processing large-scale data processing
● leverage in-memory computing
● Able to write apps in Java / Python / Scala
● combine SQL, streaming and complex analytics
● can read from HDFS, Cassandra, HBase, S3…
● integrate with YARN and MESOS
* discovered a blocking bug (MAJOR) in previous release (SPARK-1052), helped the community to narrow it down
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
Spark embeds specialized frameworks:
● Spark Streaming: leverages Spark to perform computations on streaming data (twitter feeds, raw
networks greps, …)
● Spark SQL: large scale data warehouse system for Spark that can execute HiveQL queries up to 100x
faster than Hive
● GraphX: library for large-scale graph processing. PageRank can be easily implemented using Graphx for
instance (< 40 LoC)
● ML Lib: Spark implementation of some common machine learning (ML) functionality (linear regression,
binary classification, clustering,...)
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
SparkR, v 0.1
- R front-end for Spark (link against Spark 1.1.0)
- Leverages Spark engine for distributed/parallel computations
- Allows R functions to run against larger datasets
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
As of Spark 1.4, SparkR has
been integrated into Spark
- Emerging open source parallel programming language from Cray Inc.
“Chapel's primary goal is to make parallel programming far more productive, from multicore desktops and laptops to
commodity clusters and the cloud to high-end supercomputers”
- Thanks to (working*) bindings in Go, Chapel programs can run on top of Mesos
- Partitioned Global Address Space model (PGAS), i.e. threads/processes/tasks share one big address
space across nodes
* Found buggy bindings online, managed to make them work (https://github.com/francesco-bongiovanni/mesos-chapel)
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Chapel, v 1.9 (http://chapel.cray.
com/index.html)
Detailed overview
- Extended Machine Learning
- Deep Learning
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
H2
O
(http://0xdata.com/product/sparkling-water/)
Detailed overview
Other Frameworks can be built and run on top
of Mesos:
Aurora & Marathon: Mesos framework for long-running apps (Unix equivalent of “init.d”)
Exelixi: Distributed framework running Genetic Algorithms at scale
Chronos: distributed fault-tolerant scheduler supporting complex job topologies (Unix eq. of “cron”)
Storm: distributed real-time computation system (similar to Spark Streaming)
Kafka: high-throughput distributed messaging system
MPI: Message Passing system
...port your own…
Distributed frameworks can be written in JVM based lang. (Scala, Clojure, Java, …), Python, C++, Go , Erlang
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
(Programming) Tools
Available Tools and PLs
- a HDFS Cluster (with fault-tolerance and replication enabled) to store datasets
- Applications using Spark can be programmed in Java, Scala or Python
- For building distributed frameworks on top of Mesos
- Mesos officially supports C, C++, Java/Scala and Python
- third parties bindings in Go, Erlang, Haskell,...
- iPython Notebook working with Spark (http://ipython.org/notebook.html)
- Jupyter notebook with Spark Kernel is also available (http://www.jupyter.org)
- RStudio bundled with SparkR package
- there is also RStudio server installed on the cluster
- Compilers/VMs within the cluster (and on the coming VM)
- Go 1.3.1, R 3.1.1, GCC 4.8.2 (C++11 fully supported), JDK 1.7.0_65, Scala 2.10.3, Python 2.7.6,
Chapel 1.9 with GASNET Network Library (high performance network library), Erlang 17.3
Some Examples
Class of problem Use Case Before After Improvement
Statistics Averaging 1 billion double
values
15-20 minutes 10 seconds 100x speedup
Visualization
algorithm
Weighted Maps algorithm 36K elements 2.4M elements 66.6x more elements
Clustering algorithm Pearson Correlation
Coefficient on (31K rows x
10 col) full matrix: 991M
tuples
NA 6.9 min making it possible
Clustering algorithm K-Mean on forest data
type (508K rows x 55 col )
NA 43.5 secs
(average)
making it possible
Clustering algorithm CoAbundance Clustering
on 31K rows x 7 cols
5 minutes 20 seconds 15x speedup
Results from the trenches
Problem
Steps:
Example #1 - Averaging 1 billion elements
● 17G text file with 10^9 double values
GOAL:
Compute the average of these elements
parseToDouble
sumAllElements divideBy10^9
Some numbers people should be aware of !
Example #1 - Averaging 1 billion elements
back of the envelope calculations
reading 1MB -> 30 ms
reading 17GB -> 30 * 17.000 = 510.000 ms = 510 secs = 8 minutes 30 sec
Reading a 17GB file from a single computer
17 GB
HDFS Cluster
Reading 1MB from network -> 10 ms
Reading 1MB from local disk -> 30 ms
Reading 17GB from cluster -> 40 * (17.000 / 4) = 170.000 ms = 2 min 50 secs
Example #1 - Averaging 1 billion elements
back of the envelope calculations
Reading a 17GB file from a 4 nodes cluster
Example #1 - Averaging 1 billion elements
back of the envelope calculations
Reading a 17GB file from a 1 node : 8 min 30 sec
Reading a 17GB file from a 4 nodes cluster: 2 min 50 sec
That’s almost a 3x speed improvement...
17 GB
HDFS Cluster
Reading 1MB from network - > 10 ms
Reading 1MB from local memory -> 250 µs = 0.250 ms
Reading 17GB from cluster -> 10.25 * (17.000 / 4) = 43.562 ms = 43,5 secs
Example #1 - Averaging 1 billion elements
back of the envelope calculations
Reading a 17GB file from a 4 nodes cluster IN MEMORY
Example #1 - Averaging 1 billion elements
back of the envelope calculations
Reading a 17GB file from a 1 node : 8 min 30 sec
Reading a 17GB file from a 4 nodes cluster IN MEMORY: 43,5 sec
That’s almost a 12x speed improvement...
Example #1 - Averaging 10^9 elements
On single machine
15-20 min for
read/parse/avg
for each op (avg,
sum of square,...)
Using Spark on cluster
1st iter:
read/parse/persist in memory/avg
following iters: <10 sec for
any similar operation
Example #1 - Averaging 10^9 elements
val rawDistData = sc.textFile("hdfs://10.10.0.141:9000/datasets/1B.txt")
val distData:org.apache.spark.rdd.RDD[Double] = rawDistData.map(_.toDouble)
distData.cache
distData.map(x=>x).reduce(_ + _) / 1000000000
distData.map(x=>x*x).reduce(_ + _) / 1000000000
...
read the file from HDFS
convert values to Double
put data in cache,ie in
memory
perform sum and avg
AVG of Sum of Squares
Example #1 - Averaging 1 billion elements
17 GB
Read local
chunks from
HDFS
put data in
memory
compute
avg
send result
back
aggregate
result
Example #1 - Averaging 10^9 elements
Example #2 - WeightedMaps
Weighted Maps: treemap visualization of geolocated quantitative data
(by WP4 colleagues)
● French communes with 36K elements
○ Does it scale with a larger number of
elements ?
Example #2 - WeightedMaps
● Tested with 2 million elements
○ >2^6, bottlenecks occur in spatial division
(but may be due to lack of memory) -> under
investigation
● No modifications to WM source implementation
● `just` branch it with Spark
○ minor modifications for data retrieval
convenience
○ most of the time spent on tweaking Spark (for
data serialization, memory consumption, GC
heap space...)
note: data makes no sense, union of union of French +
US counties,...
Example #2 - WeightedMaps
● Tested with 2.4 million elements
○ >2.4^6, bottlenecks occur in spatial division
(but may be due to lack of memory) -> under
investigation
● No modifications to WM source implementation
● `just` branch it with Spark
○ minor modifications for data retrieval
convenience
○ most of the time spent on tweaking Spark (for
data serialization, memory consumption, GC
heap space...)
note: data makes no sense, union of union of French +
US counties,...
Example #2 - WeightedMaps
● < 150 LoC: from parsing to
painting !
○ time perf for 2.4^6 elements:
+/- 145 sec
● < 18 sec is spent on Spark side
(no optimization whatsoever)
next steps in the pipe
● Mine GeoNames dataset with
Spark
○ extract meaningful data
● Spark impl. of WeightedMap
○ scale WM to the next level
○ rewrite WM source impl. to
leverage Spark data
structure (RDDs)
Example #3 - Computing Pearson’s Correlation Coefficient on EVA contigs
Data: EVA contigs
31K rows by 10 Columns
Pearson correlation done on a full matrix, i.e., 31K * 31K = 991.179.289 vector tuples (991 Millions)
`Naive` implementation, took 6.9 minutes
Leverages Scala NLP Library
Example #4 - K-Means Clustering on Spark
Data: Forest datasets (Covertype Data Set )
581012 rows by 55 Columns
Parsing the above lines into a proper Vector using Scala/Spark
Example #4 - K-Means Clustering on Spark
Data: Forest datasets (Covertype Data Set )
581012 rows by 54 Columns
K = 7, Max Iterations = 100
Mean running time: +/- 43.65 secs
Implementation based on k-means|| algorithm
by Bahmani et al (VLDB 2012)
→ When multiple concurrent runs are
requested,they are executed together
with joint passes over the data for
efficiency
Ongoing work - CoAbundance clustering
Formal Specs Spark-based impl.
Hydviga project
Biogas production
Contig binning
Visual Analytics-based tool (Parallel Coordinates)
Data Clustering algorithm (in R)
Great for
prototyping
Not so great for
handling large
data sets...
Let’s rewrite it...
Ongoing work - CoAbundance clustering
Using TLA+, PlusCal
- simulation
- model checking
=> helps you think above
code level
=> think of it as executable
pseudo-code which can be
checked
A kind reminder
● I am NOT a Distributed / Parallel programming fanatic
● YOU can do a lot with modern multicore machines, provided:
○ you know your way around your algorithms and concurrent
programming
○ you have a good idea how your machine works
=> Don’t come to me
if data is too small
if data fits in your computer and it’s a one time thing
=> Come to me
if data does not fit into your memory / disk
if your computation(s) is/are really expensive and frequent
However, even for small data,
- benefits from using Spark locally on your machine
- behind the scenes, there is the Actor model of computation
- so you can use it for `simple` parallel programming
Problem
Algorithm
Program
Instruction Set Architecture (ISA)
Microarchitecture
Circuits
Electrons
Somes References
Foundational paper about Spark
Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th
USENIX conference on Networked Systems Design and Implementation (NSDI 12). USENIX Association, 2012.
Foundational about Spark Streaming
Zaharia, Matei, et al. "Discretized streams: Fault-tolerant streaming computation at scale." Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles (SOSP 13). ACM, 2013.
Paper about Chapel
Chamberlain, Bradford L., David Callahan, and Hans P. Zima. "Parallel programmability and the chapel language." International Journal of
High Performance Computing Applications 21.3 (2007): 291-312.
Paper about GraphX
Gonzalez, Joseph E., et al. "GraphX: Graph Processing in a Distributed Dataflow Framework." Proceedings of the 11th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 14). USENIX Association, 2014.
Paper about Spark SQL
Xin, Reynold S., et al. "Shark: SQL and rich analytics at scale." Proceedings of the 2013 international conference on Management of data.
ACM, 2013
http://spark.apache.org
http://mesos.apache.org
http://chapel.cray.com

More Related Content

What's hot

Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
Travis Oliphant
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonWes McKinney
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
Travis Oliphant
 
Study of Various Tools for Data Science
Study of Various Tools for Data ScienceStudy of Various Tools for Data Science
Study of Various Tools for Data Science
IRJET Journal
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
Convolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlibConvolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlib
DataWorks Summit
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
Databricks
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
Databricks
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 

What's hot (20)

Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Study of Various Tools for Data Science
Study of Various Tools for Data ScienceStudy of Various Tools for Data Science
Study of Various Tools for Data Science
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
 
Convolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlibConvolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlib
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 

Similar to eScience Cluster Arch. Overview

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
 
Spark 101
Spark 101Spark 101
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 

Similar to eScience Cluster Arch. Overview (20)

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 

eScience Cluster Arch. Overview

  • 1. Big Data @ LIST - Overview - Dr. Francesco Bongiovanni
  • 2. Credentials Dr. Francesco Bongiovanni B.Sc. in Computer Systems from Haute Ecole de Namur (Belgium) ECTS certificate from Kemi-Tornio Univ. of Applied Sciences (Finland) M.Sc. in Software Engineering of Distributed Systems, KTH - Royal Institute of Technology (Sweden) Ph.D. in Computer Science from INRIA Sophia-Antipolis and Univ. of Nice Sophia Antipolis (France) - Oasis team (now Scale) Post-Doc @ INRIA Grenoble and Joseph Fourier University (UJF) - Erods team (formerly Sardes) Post-Doc @ Verimag Laboratory (CNRS) - Synchrone team Expertise ● Scalable distributed systems & algorithms ○ P2P systems (Structured/Unstructured) ● Cloud computing ● Applied formal methods (Isabelle Theorem Prover / TLA+) ● Scalable simulations ● Distributed optimizations ● Cluster computing francesco.bongiovanni@list.lu (Scholar / homepage / LinkedIn)
  • 3. Forewords and disclaimer ● The following frameworks, tools,...are all open source projects (a big chunk is from the Apache Foundation). ● This presentation is just a glimpse of what is available as of today. ● Full blown details are intentionally omitted. (configurations, setups, debugging bindings/frameworks, advanced examples...which took months to get right) As a user of this software stack, you are mainly concerned about two things : 1. Programmability; does this stack provides me with the necessary programming abstractions/tools for expressing my problems ? 2. Fault tolerance; i.e. what if I ran some stuff and tasks/process/computer fails, does that affect my computation(s) ? → lots of abstractions (for doing computation on big graphs, iterative algorithms on large datasets,...) are provided through various distributed frameworks → Fault-tolerance is baked in most of the presented frameworks.
  • 4. Scientific processing Scientists want to spend their time exploring, not coding nor waiting for their computations to be done. IDEA Write code Run code Study results Publish paper Unproductive time Never run again
  • 5. The free lunch is over has been over since 2005 ! ● Numbers of cores now is 2- 12 * ● Moore’s law continues ○ doubling of number of transistors/18 months Holy Grail: develop software whose performance scales with numbers of cores => Parallelize or perish... * Intel announced on 16/02/2015 that 18-cores Xeon chip avail. before this summer
  • 6. The free lunch is over has been over since 2005 ! ● Numbers of cores now is 2- 12 * ● Moore’s law continues ○ doubling of number of transistors/18 months Holy Grail: develop software whose performance scales with numbers of cores => Parallelize or perish... * Intel announced on 16/02/2015 that 18-cores Xeon chip avail. before this summer
  • 7. The state of things to come... by adapteva
  • 10. A technical perspective: scale and deep analysis
  • 13. eScience Cluster (for prototyping purposes) Mission ● Provide the team with Big Data programming capabilities ● Program the cluster as if it was one (big) computer ○ Reliable & Scalable Programming Abstractions ● Co-locate various distributed frameworks efficiently
  • 14. Hardware specs: ● 3 * 2 X QuadCore E5440 @ 2.83 Ghz, 40GB RAM ● 1 * 2 X QuadCore (dualCore) E5550 @ 2.67 Ghz, 48 GB RAM => Total: 40 CPUs, 168G RAM, +/- 1 TB Storage, GLAN Apache Mesos ● Resource efficient scheduler ● Reliable & Scalable ● Powered by Mesos (Hadoop, Chronos, Marathon, Storm, Spark, Aurora, Jenkins,…) Programming capabilities: ● Streaming: able to process data streams (news feeds, twitter feeds,...) ● Iterative: able to program iterative algorithms (PageRank, K- means,...) ● Interactive: able to interactively query large volumes of data ● … eScience Cluster (for prototyping purposes)
  • 16. HDFS, v 1.2.1 ● Hadoop Distributed File System -> scalable, reliable file system through replication ○ note: a file is immutable when stored on HDFS ● +/- 1 TB of avail. storage for our datasets (wiki dumps, geographical datasets,...) ● Used by Spark and other frameworks to store/retrieve datasets HDFS Mesos Spark SparkStreaming (real-time) SparkSQL (SQL) GraphX (graph) MLLib (machinelearning) Cray Chapel ZooKeeper SparkR ... H2O ... Detailed overview
  • 17. Mesos, v 0.25.0 ● Distributed OS Scheduler & tasks executor ● Leverage Linux Cgroups, no virtualization required ● Implements Dominant Resource Fairness alg. for fair task allocation ● See the data center as one big computer ● Use Apache ZooKeeper for master fail-over (distributed coordination alg.) ● !! Common resource layer over which diverse frameworks can run !! HDFS Mesos Spark SparkStreaming (real-time) SparkSQL (SQL) GraphX (graph) MLLib (machinelearning) Cray Chapel ZooKeeper SparkR ... H2O ... Detailed overview
  • 18. Spark, v 1.5.1 * ● Fast and general engine for processing large-scale data processing ● leverage in-memory computing ● Able to write apps in Java / Python / Scala ● combine SQL, streaming and complex analytics ● can read from HDFS, Cassandra, HBase, S3… ● integrate with YARN and MESOS * discovered a blocking bug (MAJOR) in previous release (SPARK-1052), helped the community to narrow it down HDFS Mesos Spark SparkStreaming (real-time) SparkSQL (SQL) GraphX (graph) MLLib (machinelearning) Cray Chapel ZooKeeper SparkR ... H2O ... Detailed overview
  • 19. Spark embeds specialized frameworks: ● Spark Streaming: leverages Spark to perform computations on streaming data (twitter feeds, raw networks greps, …) ● Spark SQL: large scale data warehouse system for Spark that can execute HiveQL queries up to 100x faster than Hive ● GraphX: library for large-scale graph processing. PageRank can be easily implemented using Graphx for instance (< 40 LoC) ● ML Lib: Spark implementation of some common machine learning (ML) functionality (linear regression, binary classification, clustering,...) HDFS Mesos Spark SparkStreaming (real-time) SparkSQL (SQL) GraphX (graph) MLLib (machinelearning) Cray Chapel ZooKeeper SparkR ... H2O ... Detailed overview
  • 20. SparkR, v 0.1 - R front-end for Spark (link against Spark 1.1.0) - Leverages Spark engine for distributed/parallel computations - Allows R functions to run against larger datasets HDFS Mesos Spark SparkStreaming (real-time) SparkSQL (SQL) GraphX (graph) MLLib (machinelearning) Cray Chapel ZooKeeper SparkR ... H2O ... Detailed overview As of Spark 1.4, SparkR has been integrated into Spark
  • 21. - Emerging open source parallel programming language from Cray Inc. “Chapel's primary goal is to make parallel programming far more productive, from multicore desktops and laptops to commodity clusters and the cloud to high-end supercomputers” - Thanks to (working*) bindings in Go, Chapel programs can run on top of Mesos - Partitioned Global Address Space model (PGAS), i.e. threads/processes/tasks share one big address space across nodes * Found buggy bindings online, managed to make them work (https://github.com/francesco-bongiovanni/mesos-chapel) HDFS Mesos Spark SparkStreaming (real-time) SparkSQL (SQL) GraphX (graph) MLLib (machinelearning) Cray Chapel ZooKeeper SparkR ... H2O ... Chapel, v 1.9 (http://chapel.cray. com/index.html) Detailed overview
  • 22. - Extended Machine Learning - Deep Learning HDFS Mesos Spark SparkStreaming (real-time) SparkSQL (SQL) GraphX (graph) MLLib (machinelearning) Cray Chapel ZooKeeper SparkR ... H2O ... H2 O (http://0xdata.com/product/sparkling-water/) Detailed overview
  • 23. Other Frameworks can be built and run on top of Mesos: Aurora & Marathon: Mesos framework for long-running apps (Unix equivalent of “init.d”) Exelixi: Distributed framework running Genetic Algorithms at scale Chronos: distributed fault-tolerant scheduler supporting complex job topologies (Unix eq. of “cron”) Storm: distributed real-time computation system (similar to Spark Streaming) Kafka: high-throughput distributed messaging system MPI: Message Passing system ...port your own… Distributed frameworks can be written in JVM based lang. (Scala, Clojure, Java, …), Python, C++, Go , Erlang HDFS Mesos Spark SparkStreaming (real-time) SparkSQL (SQL) GraphX (graph) MLLib (machinelearning) Cray Chapel ZooKeeper SparkR ... H2O ... Detailed overview
  • 25. Available Tools and PLs - a HDFS Cluster (with fault-tolerance and replication enabled) to store datasets - Applications using Spark can be programmed in Java, Scala or Python - For building distributed frameworks on top of Mesos - Mesos officially supports C, C++, Java/Scala and Python - third parties bindings in Go, Erlang, Haskell,... - iPython Notebook working with Spark (http://ipython.org/notebook.html) - Jupyter notebook with Spark Kernel is also available (http://www.jupyter.org) - RStudio bundled with SparkR package - there is also RStudio server installed on the cluster - Compilers/VMs within the cluster (and on the coming VM) - Go 1.3.1, R 3.1.1, GCC 4.8.2 (C++11 fully supported), JDK 1.7.0_65, Scala 2.10.3, Python 2.7.6, Chapel 1.9 with GASNET Network Library (high performance network library), Erlang 17.3
  • 27. Class of problem Use Case Before After Improvement Statistics Averaging 1 billion double values 15-20 minutes 10 seconds 100x speedup Visualization algorithm Weighted Maps algorithm 36K elements 2.4M elements 66.6x more elements Clustering algorithm Pearson Correlation Coefficient on (31K rows x 10 col) full matrix: 991M tuples NA 6.9 min making it possible Clustering algorithm K-Mean on forest data type (508K rows x 55 col ) NA 43.5 secs (average) making it possible Clustering algorithm CoAbundance Clustering on 31K rows x 7 cols 5 minutes 20 seconds 15x speedup Results from the trenches
  • 28. Problem Steps: Example #1 - Averaging 1 billion elements ● 17G text file with 10^9 double values GOAL: Compute the average of these elements parseToDouble sumAllElements divideBy10^9
  • 29. Some numbers people should be aware of !
  • 30. Example #1 - Averaging 1 billion elements back of the envelope calculations reading 1MB -> 30 ms reading 17GB -> 30 * 17.000 = 510.000 ms = 510 secs = 8 minutes 30 sec Reading a 17GB file from a single computer
  • 31. 17 GB HDFS Cluster Reading 1MB from network -> 10 ms Reading 1MB from local disk -> 30 ms Reading 17GB from cluster -> 40 * (17.000 / 4) = 170.000 ms = 2 min 50 secs Example #1 - Averaging 1 billion elements back of the envelope calculations Reading a 17GB file from a 4 nodes cluster
  • 32. Example #1 - Averaging 1 billion elements back of the envelope calculations Reading a 17GB file from a 1 node : 8 min 30 sec Reading a 17GB file from a 4 nodes cluster: 2 min 50 sec That’s almost a 3x speed improvement...
  • 33. 17 GB HDFS Cluster Reading 1MB from network - > 10 ms Reading 1MB from local memory -> 250 µs = 0.250 ms Reading 17GB from cluster -> 10.25 * (17.000 / 4) = 43.562 ms = 43,5 secs Example #1 - Averaging 1 billion elements back of the envelope calculations Reading a 17GB file from a 4 nodes cluster IN MEMORY
  • 34. Example #1 - Averaging 1 billion elements back of the envelope calculations Reading a 17GB file from a 1 node : 8 min 30 sec Reading a 17GB file from a 4 nodes cluster IN MEMORY: 43,5 sec That’s almost a 12x speed improvement...
  • 35. Example #1 - Averaging 10^9 elements On single machine 15-20 min for read/parse/avg for each op (avg, sum of square,...) Using Spark on cluster 1st iter: read/parse/persist in memory/avg following iters: <10 sec for any similar operation
  • 36. Example #1 - Averaging 10^9 elements val rawDistData = sc.textFile("hdfs://10.10.0.141:9000/datasets/1B.txt") val distData:org.apache.spark.rdd.RDD[Double] = rawDistData.map(_.toDouble) distData.cache distData.map(x=>x).reduce(_ + _) / 1000000000 distData.map(x=>x*x).reduce(_ + _) / 1000000000 ... read the file from HDFS convert values to Double put data in cache,ie in memory perform sum and avg AVG of Sum of Squares
  • 37. Example #1 - Averaging 1 billion elements 17 GB Read local chunks from HDFS put data in memory compute avg send result back aggregate result
  • 38. Example #1 - Averaging 10^9 elements
  • 39. Example #2 - WeightedMaps Weighted Maps: treemap visualization of geolocated quantitative data (by WP4 colleagues) ● French communes with 36K elements ○ Does it scale with a larger number of elements ?
  • 40. Example #2 - WeightedMaps ● Tested with 2 million elements ○ >2^6, bottlenecks occur in spatial division (but may be due to lack of memory) -> under investigation ● No modifications to WM source implementation ● `just` branch it with Spark ○ minor modifications for data retrieval convenience ○ most of the time spent on tweaking Spark (for data serialization, memory consumption, GC heap space...) note: data makes no sense, union of union of French + US counties,...
  • 41. Example #2 - WeightedMaps ● Tested with 2.4 million elements ○ >2.4^6, bottlenecks occur in spatial division (but may be due to lack of memory) -> under investigation ● No modifications to WM source implementation ● `just` branch it with Spark ○ minor modifications for data retrieval convenience ○ most of the time spent on tweaking Spark (for data serialization, memory consumption, GC heap space...) note: data makes no sense, union of union of French + US counties,...
  • 42. Example #2 - WeightedMaps ● < 150 LoC: from parsing to painting ! ○ time perf for 2.4^6 elements: +/- 145 sec ● < 18 sec is spent on Spark side (no optimization whatsoever) next steps in the pipe ● Mine GeoNames dataset with Spark ○ extract meaningful data ● Spark impl. of WeightedMap ○ scale WM to the next level ○ rewrite WM source impl. to leverage Spark data structure (RDDs)
  • 43. Example #3 - Computing Pearson’s Correlation Coefficient on EVA contigs Data: EVA contigs 31K rows by 10 Columns Pearson correlation done on a full matrix, i.e., 31K * 31K = 991.179.289 vector tuples (991 Millions) `Naive` implementation, took 6.9 minutes Leverages Scala NLP Library
  • 44. Example #4 - K-Means Clustering on Spark Data: Forest datasets (Covertype Data Set ) 581012 rows by 55 Columns Parsing the above lines into a proper Vector using Scala/Spark
  • 45. Example #4 - K-Means Clustering on Spark Data: Forest datasets (Covertype Data Set ) 581012 rows by 54 Columns K = 7, Max Iterations = 100 Mean running time: +/- 43.65 secs Implementation based on k-means|| algorithm by Bahmani et al (VLDB 2012) → When multiple concurrent runs are requested,they are executed together with joint passes over the data for efficiency
  • 46. Ongoing work - CoAbundance clustering Formal Specs Spark-based impl. Hydviga project Biogas production Contig binning Visual Analytics-based tool (Parallel Coordinates) Data Clustering algorithm (in R) Great for prototyping Not so great for handling large data sets... Let’s rewrite it...
  • 47. Ongoing work - CoAbundance clustering Using TLA+, PlusCal - simulation - model checking => helps you think above code level => think of it as executable pseudo-code which can be checked
  • 48. A kind reminder ● I am NOT a Distributed / Parallel programming fanatic ● YOU can do a lot with modern multicore machines, provided: ○ you know your way around your algorithms and concurrent programming ○ you have a good idea how your machine works => Don’t come to me if data is too small if data fits in your computer and it’s a one time thing => Come to me if data does not fit into your memory / disk if your computation(s) is/are really expensive and frequent However, even for small data, - benefits from using Spark locally on your machine - behind the scenes, there is the Actor model of computation - so you can use it for `simple` parallel programming Problem Algorithm Program Instruction Set Architecture (ISA) Microarchitecture Circuits Electrons
  • 49. Somes References Foundational paper about Spark Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI 12). USENIX Association, 2012. Foundational about Spark Streaming Zaharia, Matei, et al. "Discretized streams: Fault-tolerant streaming computation at scale." Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP 13). ACM, 2013. Paper about Chapel Chamberlain, Bradford L., David Callahan, and Hans P. Zima. "Parallel programmability and the chapel language." International Journal of High Performance Computing Applications 21.3 (2007): 291-312. Paper about GraphX Gonzalez, Joseph E., et al. "GraphX: Graph Processing in a Distributed Dataflow Framework." Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, 2014. Paper about Spark SQL Xin, Reynold S., et al. "Shark: SQL and rich analytics at scale." Proceedings of the 2013 international conference on Management of data. ACM, 2013 http://spark.apache.org http://mesos.apache.org http://chapel.cray.com