eScience Cluster Arch. Overview

Big Data @ LIST
- Overview -
Dr. Francesco Bongiovanni

Credentials
Dr. Francesco Bongiovanni
B.Sc. in Computer Systems from Haute Ecole de Namur (Belgium)
ECTS certificate from Kemi-Tornio Univ. of Applied Sciences (Finland)
M.Sc. in Software Engineering of Distributed Systems, KTH - Royal Institute of Technology (Sweden)
Ph.D. in Computer Science from INRIA Sophia-Antipolis and Univ. of Nice Sophia Antipolis (France) - Oasis team (now Scale)
Post-Doc @ INRIA Grenoble and Joseph Fourier University (UJF) - Erods team (formerly Sardes)
Post-Doc @ Verimag Laboratory (CNRS) - Synchrone team
Expertise
● Scalable distributed systems & algorithms
○ P2P systems (Structured/Unstructured)
● Cloud computing
● Applied formal methods (Isabelle Theorem Prover / TLA+)
● Scalable simulations
● Distributed optimizations
● Cluster computing
francesco.bongiovanni@list.lu (Scholar / homepage / LinkedIn)

Forewords and disclaimer
● The following frameworks, tools,...are all open source projects (a big chunk is from the Apache
Foundation).
● This presentation is just a glimpse of what is available as of today.
● Full blown details are intentionally omitted. (configurations, setups, debugging bindings/frameworks,
advanced examples...which took months to get right)
As a user of this software stack, you are mainly concerned about two things :
1. Programmability; does this stack provides me with the necessary programming abstractions/tools for
expressing my problems ?
2. Fault tolerance; i.e. what if I ran some stuff and tasks/process/computer fails, does that affect my
computation(s) ?
→ lots of abstractions (for doing computation on big graphs, iterative algorithms on large datasets,...) are
provided through various distributed frameworks
→ Fault-tolerance is baked in most of the presented frameworks.

Scientific processing
Scientists want to spend their time exploring, not coding nor
waiting for their computations to be done.
IDEA
Write code Run code
Study
results
Publish
paper
Unproductive time
Never run again

The free lunch is over has been over since 2005 !
● Numbers of cores now is 2-
12 *
● Moore’s law continues
○ doubling of number of
transistors/18 months
Holy Grail: develop software
whose performance scales with
numbers of cores
=> Parallelize or perish...
* Intel announced on 16/02/2015 that 18-cores Xeon
chip avail. before this summer

The state of things to come...
by adapteva

A technical perspective: scale and deep analysis

Software Infrastructure Overview

eScience Cluster (for prototyping purposes)
Mission
● Provide the team with Big Data programming
capabilities
● Program the cluster as if it was one (big) computer
○ Reliable & Scalable Programming Abstractions
● Co-locate various distributed frameworks efficiently

Hardware specs:
● 3 * 2 X QuadCore E5440 @ 2.83 Ghz, 40GB RAM
● 1 * 2 X QuadCore (dualCore) E5550 @ 2.67 Ghz, 48 GB
RAM
=> Total: 40 CPUs, 168G RAM, +/- 1 TB Storage, GLAN
Apache Mesos
● Resource efficient scheduler
● Reliable & Scalable
● Powered by Mesos (Hadoop, Chronos, Marathon, Storm, Spark,
Aurora, Jenkins,…)
Programming capabilities:
● Streaming: able to process data streams (news feeds, twitter
feeds,...)
● Iterative: able to program iterative algorithms (PageRank, K-
means,...)
● Interactive: able to interactively query large volumes of data
● …
eScience Cluster (for prototyping purposes)

Detailed overview
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
(as of Oct. 2015)
SparkR
...
H2O
...

HDFS, v 1.2.1
● Hadoop Distributed File System -> scalable, reliable file system through replication
○ note: a file is immutable when stored on HDFS
● +/- 1 TB of avail. storage for our datasets (wiki dumps, geographical datasets,...)
● Used by Spark and other frameworks to store/retrieve datasets
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview

Mesos, v 0.25.0
● Distributed OS Scheduler & tasks executor
● Leverage Linux Cgroups, no virtualization required
● Implements Dominant Resource Fairness alg. for fair task allocation
● See the data center as one big computer
● Use Apache ZooKeeper for master fail-over (distributed coordination alg.)
● !! Common resource layer over which diverse frameworks can run !!
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview

Spark, v 1.5.1 *
● Fast and general engine for processing large-scale data processing
● leverage in-memory computing
● Able to write apps in Java / Python / Scala
● combine SQL, streaming and complex analytics
● can read from HDFS, Cassandra, HBase, S3…
● integrate with YARN and MESOS
* discovered a blocking bug (MAJOR) in previous release (SPARK-1052), helped the community to narrow it down
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview

Spark embeds specialized frameworks:
● Spark Streaming: leverages Spark to perform computations on streaming data (twitter feeds, raw
networks greps, …)
● Spark SQL: large scale data warehouse system for Spark that can execute HiveQL queries up to 100x
faster than Hive
● GraphX: library for large-scale graph processing. PageRank can be easily implemented using Graphx for
instance (< 40 LoC)
● ML Lib: Spark implementation of some common machine learning (ML) functionality (linear regression,
binary classification, clustering,...)
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview

SparkR, v 0.1
- R front-end for Spark (link against Spark 1.1.0)
- Leverages Spark engine for distributed/parallel computations
- Allows R functions to run against larger datasets
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
As of Spark 1.4, SparkR has
been integrated into Spark

- Emerging open source parallel programming language from Cray Inc.
“Chapel's primary goal is to make parallel programming far more productive, from multicore desktops and laptops to
commodity clusters and the cloud to high-end supercomputers”
- Thanks to (working*) bindings in Go, Chapel programs can run on top of Mesos
- Partitioned Global Address Space model (PGAS), i.e. threads/processes/tasks share one big address
space across nodes
* Found buggy bindings online, managed to make them work (https://github.com/francesco-bongiovanni/mesos-chapel)
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Chapel, v 1.9 (http://chapel.cray.
com/index.html)
Detailed overview

- Extended Machine Learning
- Deep Learning
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
H2
O
(http://0xdata.com/product/sparkling-water/)
Detailed overview

Other Frameworks can be built and run on top
of Mesos:
Aurora & Marathon: Mesos framework for long-running apps (Unix equivalent of “init.d”)
Exelixi: Distributed framework running Genetic Algorithms at scale
Chronos: distributed fault-tolerant scheduler supporting complex job topologies (Unix eq. of “cron”)
Storm: distributed real-time computation system (similar to Spark Streaming)
Kafka: high-throughput distributed messaging system
MPI: Message Passing system
...port your own…
Distributed frameworks can be written in JVM based lang. (Scala, Clojure, Java, …), Python, C++, Go , Erlang
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview

Available Tools and PLs
- a HDFS Cluster (with fault-tolerance and replication enabled) to store datasets
- Applications using Spark can be programmed in Java, Scala or Python
- For building distributed frameworks on top of Mesos
- Mesos officially supports C, C++, Java/Scala and Python
- third parties bindings in Go, Erlang, Haskell,...
- iPython Notebook working with Spark (http://ipython.org/notebook.html)
- Jupyter notebook with Spark Kernel is also available (http://www.jupyter.org)
- RStudio bundled with SparkR package
- there is also RStudio server installed on the cluster
- Compilers/VMs within the cluster (and on the coming VM)
- Go 1.3.1, R 3.1.1, GCC 4.8.2 (C++11 fully supported), JDK 1.7.0_65, Scala 2.10.3, Python 2.7.6,
Chapel 1.9 with GASNET Network Library (high performance network library), Erlang 17.3

Class of problem Use Case Before After Improvement
Statistics Averaging 1 billion double
values
15-20 minutes 10 seconds 100x speedup
Visualization
algorithm
Weighted Maps algorithm 36K elements 2.4M elements 66.6x more elements
Clustering algorithm Pearson Correlation
Coefficient on (31K rows x
10 col) full matrix: 991M
tuples
NA 6.9 min making it possible
Clustering algorithm K-Mean on forest data
type (508K rows x 55 col )
NA 43.5 secs
(average)
making it possible
Clustering algorithm CoAbundance Clustering
on 31K rows x 7 cols
5 minutes 20 seconds 15x speedup
Results from the trenches

Problem
Steps:
Example #1 - Averaging 1 billion elements
● 17G text file with 10^9 double values
GOAL:
Compute the average of these elements
parseToDouble
sumAllElements divideBy10^9

Some numbers people should be aware of !

back of the envelope calculations
reading 1MB -> 30 ms
reading 17GB -> 30 * 17.000 = 510.000 ms = 510 secs = 8 minutes 30 sec
Reading a 17GB file from a single computer

17 GB
HDFS Cluster
Reading 1MB from network -> 10 ms
Reading 1MB from local disk -> 30 ms
Reading 17GB from cluster -> 40 * (17.000 / 4) = 170.000 ms = 2 min 50 secs
Reading a 17GB file from a 4 nodes cluster

Reading a 17GB file from a 1 node : 8 min 30 sec
Reading a 17GB file from a 4 nodes cluster: 2 min 50 sec
That’s almost a 3x speed improvement...

17 GB
HDFS Cluster
Reading 1MB from network - > 10 ms
Reading 1MB from local memory -> 250 µs = 0.250 ms
Reading 17GB from cluster -> 10.25 * (17.000 / 4) = 43.562 ms = 43,5 secs
Reading a 17GB file from a 4 nodes cluster IN MEMORY

Reading a 17GB file from a 1 node : 8 min 30 sec
Reading a 17GB file from a 4 nodes cluster IN MEMORY: 43,5 sec
That’s almost a 12x speed improvement...

Example #1 - Averaging 10^9 elements
On single machine
15-20 min for
read/parse/avg
for each op (avg,
sum of square,...)
Using Spark on cluster
1st iter:
read/parse/persist in memory/avg
following iters: <10 sec for
any similar operation

val rawDistData = sc.textFile("hdfs://10.10.0.141:9000/datasets/1B.txt")
val distData:org.apache.spark.rdd.RDD[Double] = rawDistData.map(_.toDouble)
distData.cache
distData.map(x=>x).reduce(_ + _) / 1000000000
distData.map(x=>x*x).reduce(_ + _) / 1000000000
...
read the file from HDFS
convert values to Double
put data in cache,ie in
memory
perform sum and avg
AVG of Sum of Squares

17 GB
Read local
chunks from
HDFS
put data in
memory
compute
avg
send result
back
aggregate
result

Example #2 - WeightedMaps
Weighted Maps: treemap visualization of geolocated quantitative data
(by WP4 colleagues)
● French communes with 36K elements
○ Does it scale with a larger number of
elements ?

● Tested with 2 million elements
○ >2^6, bottlenecks occur in spatial division
(but may be due to lack of memory) -> under
investigation
● No modifications to WM source implementation
● `just` branch it with Spark
○ minor modifications for data retrieval
convenience
○ most of the time spent on tweaking Spark (for
data serialization, memory consumption, GC
heap space...)
note: data makes no sense, union of union of French +
US counties,...

● Tested with 2.4 million elements
○ >2.4^6, bottlenecks occur in spatial division
(but may be due to lack of memory) -> under
investigation
● No modifications to WM source implementation
● `just` branch it with Spark
○ minor modifications for data retrieval
convenience
○ most of the time spent on tweaking Spark (for
data serialization, memory consumption, GC
heap space...)
note: data makes no sense, union of union of French +
US counties,...

● < 150 LoC: from parsing to
painting !
○ time perf for 2.4^6 elements:
+/- 145 sec
● < 18 sec is spent on Spark side
(no optimization whatsoever)
next steps in the pipe
● Mine GeoNames dataset with
Spark
○ extract meaningful data
● Spark impl. of WeightedMap
○ scale WM to the next level
○ rewrite WM source impl. to
leverage Spark data
structure (RDDs)

Example #3 - Computing Pearson’s Correlation Coefficient on EVA contigs
Data: EVA contigs
31K rows by 10 Columns
Pearson correlation done on a full matrix, i.e., 31K * 31K = 991.179.289 vector tuples (991 Millions)
`Naive` implementation, took 6.9 minutes
Leverages Scala NLP Library

Example #4 - K-Means Clustering on Spark
Data: Forest datasets (Covertype Data Set )
581012 rows by 55 Columns
Parsing the above lines into a proper Vector using Scala/Spark

Example #4 - K-Means Clustering on Spark
Data: Forest datasets (Covertype Data Set )
581012 rows by 54 Columns
K = 7, Max Iterations = 100
Mean running time: +/- 43.65 secs
Implementation based on k-means|| algorithm
by Bahmani et al (VLDB 2012)
→ When multiple concurrent runs are
requested,they are executed together
with joint passes over the data for
efficiency

Ongoing work - CoAbundance clustering
Formal Specs Spark-based impl.
Hydviga project
Biogas production
Contig binning
Visual Analytics-based tool (Parallel Coordinates)
Data Clustering algorithm (in R)
Great for
prototyping
Not so great for
handling large
data sets...
Let’s rewrite it...

Ongoing work - CoAbundance clustering
Using TLA+, PlusCal
- simulation
- model checking
=> helps you think above
code level
=> think of it as executable
pseudo-code which can be
checked

A kind reminder
● I am NOT a Distributed / Parallel programming fanatic
● YOU can do a lot with modern multicore machines, provided:
○ you know your way around your algorithms and concurrent
programming
○ you have a good idea how your machine works
=> Don’t come to me
if data is too small
if data fits in your computer and it’s a one time thing
=> Come to me
if data does not fit into your memory / disk
if your computation(s) is/are really expensive and frequent
However, even for small data,
- benefits from using Spark locally on your machine
- behind the scenes, there is the Actor model of computation
- so you can use it for `simple` parallel programming
Problem
Algorithm
Program
Instruction Set Architecture (ISA)
Microarchitecture
Circuits
Electrons

Somes References
Foundational paper about Spark
Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th
USENIX conference on Networked Systems Design and Implementation (NSDI 12). USENIX Association, 2012.
Foundational about Spark Streaming
Zaharia, Matei, et al. "Discretized streams: Fault-tolerant streaming computation at scale." Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles (SOSP 13). ACM, 2013.
Paper about Chapel
Chamberlain, Bradford L., David Callahan, and Hans P. Zima. "Parallel programmability and the chapel language." International Journal of
High Performance Computing Applications 21.3 (2007): 291-312.
Paper about GraphX
Gonzalez, Joseph E., et al. "GraphX: Graph Processing in a Distributed Dataflow Framework." Proceedings of the 11th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 14). USENIX Association, 2014.
Paper about Spark SQL
Xin, Reynold S., et al. "Shark: SQL and rich analytics at scale." Proceedings of the 2013 international conference on Management of data.
ACM, 2013
http://spark.apache.org
http://mesos.apache.org
http://chapel.cray.com

eScience Cluster Arch. Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to eScience Cluster Arch. Overview

Similar to eScience Cluster Arch. Overview (20)

eScience Cluster Arch. Overview