Why Spark for large scale data analysis

Spark for Large
Scale Data
Analysis
@snithish_
@snithish

HELLO!
Developer @ Thoughtworks
Built and managed data lake for a retail enterprise
Avid learner of distributed systems
Co-maintainer of spark-fast-test
2

1.
We can analyze data
on a workstation

▪ A single CSV file
▪ Typically small in size
▪ We load this file in python or R creating a
dataframe
▪ Manipulate this dataframe and visualize or
model
In the beginning
5

▪ What happens when we instruct python to open
a file and read it contents
Zooming in On file load
6

▪ File brought from disk to memory (mmap)
▪ Read from memory
▪ And written to screen
7

▪ What if file size is larger than memory?
▪ File is divided into chunks and brought into
memory and once consumed removed from
memory
8

▪ Imagine running logistics regression on this
large file
▪ For each iteration the entire file would be
brought in memory and removed
▪ We call this Thrashing in OS parlance
Thrashing
10

▪ Despite efficient scheduling bring data from
persistent storage to memory is time
consuming
▪ If data is in a compressed format, we would
have to uncompress it
Resource wastage
11

Latency number (Jonas Bonér)
12

▪ Try to keep things in memory as much as
possible
Rule of Thumb
13

1.b.
The parallelism
conundrum

▪ Python and R don’t support parallelism out of
the box
▪ But python and R are concurrent
Parallelism vs Concurrency
15

▪ We have one knife
▪ Knife is shared by family members
▪ When one member is using the knife other must
wait their turn
Concurrency
16

▪ There are 4 knives
▪ Family member can pick any one of the free
knife of the 4
▪ If all 4 are being used then they wait for one to
become available
Parallelism
17

▪ Managing ‘global’ state is hard in multi cpu
environments
▪ GIL -> Global interpreter lock
When Concurrency in multi cpu (core)
18

▪ In python spawning threads is equivalent to
concurrency
▪ Spawning processes is equivalent to parallelism
Processes
19

▪ Each iteration spawns a new process
▪ Each process tries to open and read the file
▪ Multiple copies of file kept in memory
Spawning Process
20

▪ Communication between processes (IPC)
▪ Shared memory
▪ Message Queues
▪ Very difficult to handle
Avoiding Multiple Copies
21

▪ Volume
▪ Velocity
▪ Variety
3V ?
23

“The crawl archive for
October 2018 is now
available! It contains 3.0
billion web pages and 240
TiB of uncompressed
content, crawled between
October 15th and 24th.
24

Scaling
26
▪ Vertical scaling
▪ Horizontal scaling

Vertical Scaling
27
▪ Keep upgrading the a machine to accomodate
large data
▪ AWS instances of 384 RAM and 96 cores
▪ Limits on how much upgrades can be done
▪ High cost -> $6.048 / hour

Horizontal Scaling
28
▪ Scale by adding commodity hardware
▪ Marginally less cost
▫ One 16 core 32GB RAM cost $0.4608 / hour
▫ We would need 12 of these to compete with
vertical limits => $5.5296 / hour
▫ No limitation to how many nodes we add

Horizontal Scaling
29
▪ Parallelism across node
▪ Complexity of managing is compounded
▪ Need for easier semantics for working with
distributed nodes

▪ Map-Reduce like OO is a programming
paradigm
▪ It is well suited for parallel distributed data
processing
▪ A central principle in Functional Programming
A Programming Paradigm
31

▪ Programs are written as a series of map and
reduce phases
A Programming Paradigm
32

▪ Find how many times a link is referenced in
other websites to find it’s rank (page rank)
Example
33

▪ For each document find all <a href> tags and
extract them as a list
▪ Iterate through list and find the previous count
from a map
▫ If previous count present, set newCount =
previousCount + 1
▫ Else, newCount = 1
Iterative Style - One Document at a time
34

▪ Previous strategy fails as we will have to share
map across process
▪ Locks have to be acquired adding additional
delays
Iterative Style - Multiple Document at a time (Multiple process)
35

▪ Each process works with it’s own map
▪ When process is done, it emits it’s response
▪ Once all documents are mapped
▪ We start combining all the counts in each map
this is the reduce stage
Map Reduce - Multiple Document at a time (Multiple process)
36

<h1>Document 1</h1>
<a href="www.google.com">Google</a>
<a href="www.ssn.edu.in">Google</a>
<a href="www.yahoo.com">Google</a>
<a href="www.ssn.edu.in">Google</a>
Map Reduce - Multiple Document at a time (Multiple
process)
<h1>Document 2</h1>
<a href="www.google.com">Google</a>
<a href="www.github.com">Google</a>
37
www.google.com 1
www.ssn.edu.in 2
www.yahoo.com 1
www.google.com 1
www.github.com 2

Map Reduce - Multiple Document at a time (Multiple
process)
38
www.google.com 1
www.ssn.edu.in 2
www.yahoo.com 1
www.google.com 1
www.github.com 2
www.google.com 2
www.ssn.edu.in 2
www.yahoo.com 1
www.github.com 1

4.
Let’s take Spark for a
spin

THANKS!
Any questions?
You can find me at:
40
@snithish_
nithishsankaranarayanan@gmail.com

Why Spark for large scale data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why Spark for large scale data analysis

Similar to Why Spark for large scale data analysis (20)

Recently uploaded

Recently uploaded (20)

Why Spark for large scale data analysis