Spark for Large
Scale Data
Analysis
@snithish_
@snithish
HELLO!
Developer @ Thoughtworks
Built and managed data lake for a retail enterprise
Avid learner of distributed systems
Co-maintainer of spark-fast-test
2
1.
We can analyze data
on a workstation
1.a.
The memory conundrum
▪ A single CSV file
▪ Typically small in size
▪ We load this file in python or R creating a
dataframe
▪ Manipulate this dataframe and visualize or
model
In the beginning
5
▪ What happens when we instruct python to open
a file and read it contents
Zooming in On file load
6
▪ File brought from disk to memory (mmap)
▪ Read from memory
▪ And written to screen
Zooming in On file load
7
▪ What if file size is larger than memory?
▪ File is divided into chunks and brought into
memory and once consumed removed from
memory
Zooming in On file load
8
Zooming in On file load
9
▪ Imagine running logistics regression on this
large file
▪ For each iteration the entire file would be
brought in memory and removed
▪ We call this Thrashing in OS parlance
Thrashing
10
▪ Despite efficient scheduling bring data from
persistent storage to memory is time
consuming
▪ If data is in a compressed format, we would
have to uncompress it
Resource wastage
11
Latency number (Jonas Bonér)
12
▪ Try to keep things in memory as much as
possible
Rule of Thumb
13
1.b.
The parallelism
conundrum
▪ Python and R don’t support parallelism out of
the box
▪ But python and R are concurrent
Parallelism vs Concurrency
15
▪ We have one knife
▪ Knife is shared by family members
▪ When one member is using the knife other must
wait their turn
Concurrency
16
▪ There are 4 knives
▪ Family member can pick any one of the free
knife of the 4
▪ If all 4 are being used then they wait for one to
become available
Parallelism
17
▪ Managing ‘global’ state is hard in multi cpu
environments
▪ GIL -> Global interpreter lock
When Concurrency in multi cpu (core)
18
▪ In python spawning threads is equivalent to
concurrency
▪ Spawning processes is equivalent to parallelism
Processes
19
▪ Each iteration spawns a new process
▪ Each process tries to open and read the file
▪ Multiple copies of file kept in memory
Spawning Process
20
▪ Communication between processes (IPC)
▪ Shared memory
▪ Message Queues
▪ Very difficult to handle
Avoiding Multiple Copies
21
2.
Internet Scale
▪ Volume
▪ Velocity
▪ Variety
3V ?
23
“The crawl archive for
October 2018 is now
available! It contains 3.0
billion web pages and 240
TiB of uncompressed
content, crawled between
October 15th and 24th.
24
25
Scaling
26
▪ Vertical scaling
▪ Horizontal scaling
Vertical Scaling
27
▪ Keep upgrading the a machine to accomodate
large data
▪ AWS instances of 384 RAM and 96 cores
▪ Limits on how much upgrades can be done
▪ High cost -> $6.048 / hour
Horizontal Scaling
28
▪ Scale by adding commodity hardware
▪ Marginally less cost
▫ One 16 core 32GB RAM cost $0.4608 / hour
▫ We would need 12 of these to compete with
vertical limits => $5.5296 / hour
▫ No limitation to how many nodes we add
Horizontal Scaling
29
▪ Parallelism across node
▪ Complexity of managing is compounded
▪ Need for easier semantics for working with
distributed nodes
3.
Map Reduce
▪ Map-Reduce like OO is a programming
paradigm
▪ It is well suited for parallel distributed data
processing
▪ A central principle in Functional Programming
A Programming Paradigm
31
▪ Programs are written as a series of map and
reduce phases
A Programming Paradigm
32
▪ Find how many times a link is referenced in
other websites to find it’s rank (page rank)
Example
33
▪ For each document find all <a href> tags and
extract them as a list
▪ Iterate through list and find the previous count
from a map
▫ If previous count present, set newCount =
previousCount + 1
▫ Else, newCount = 1
Iterative Style - One Document at a time
34
▪ Previous strategy fails as we will have to share
map across process
▪ Locks have to be acquired adding additional
delays
Iterative Style - Multiple Document at a time (Multiple process)
35
▪ Each process works with it’s own map
▪ When process is done, it emits it’s response
▪ Once all documents are mapped
▪ We start combining all the counts in each map
this is the reduce stage
Map Reduce - Multiple Document at a time (Multiple process)
36
<h1>Document 1</h1>
<a href="www.google.com">Google</a>
<a href="www.ssn.edu.in">Google</a>
<a href="www.yahoo.com">Google</a>
<a href="www.ssn.edu.in">Google</a>
Map Reduce - Multiple Document at a time (Multiple
process)
<h1>Document 2</h1>
<a href="www.google.com">Google</a>
<a href="www.github.com">Google</a>
37
www.google.com 1
www.ssn.edu.in 2
www.yahoo.com 1
www.google.com 1
www.github.com 2
Map Reduce - Multiple Document at a time (Multiple
process)
38
www.google.com 1
www.ssn.edu.in 2
www.yahoo.com 1
www.google.com 1
www.github.com 2
www.google.com 2
www.ssn.edu.in 2
www.yahoo.com 1
www.github.com 1
4.
Let’s take Spark for a
spin
THANKS!
Any questions?
You can find me at:
40
@snithish_
nithishsankaranarayanan@gmail.com

Why Spark for large scale data analysis

  • 1.
    Spark for Large ScaleData Analysis @snithish_ @snithish
  • 2.
    HELLO! Developer @ Thoughtworks Builtand managed data lake for a retail enterprise Avid learner of distributed systems Co-maintainer of spark-fast-test 2
  • 3.
    1. We can analyzedata on a workstation
  • 4.
  • 5.
    ▪ A singleCSV file ▪ Typically small in size ▪ We load this file in python or R creating a dataframe ▪ Manipulate this dataframe and visualize or model In the beginning 5
  • 6.
    ▪ What happenswhen we instruct python to open a file and read it contents Zooming in On file load 6
  • 7.
    ▪ File broughtfrom disk to memory (mmap) ▪ Read from memory ▪ And written to screen Zooming in On file load 7
  • 8.
    ▪ What iffile size is larger than memory? ▪ File is divided into chunks and brought into memory and once consumed removed from memory Zooming in On file load 8
  • 9.
    Zooming in Onfile load 9
  • 10.
    ▪ Imagine runninglogistics regression on this large file ▪ For each iteration the entire file would be brought in memory and removed ▪ We call this Thrashing in OS parlance Thrashing 10
  • 11.
    ▪ Despite efficientscheduling bring data from persistent storage to memory is time consuming ▪ If data is in a compressed format, we would have to uncompress it Resource wastage 11
  • 12.
  • 13.
    ▪ Try tokeep things in memory as much as possible Rule of Thumb 13
  • 14.
  • 15.
    ▪ Python andR don’t support parallelism out of the box ▪ But python and R are concurrent Parallelism vs Concurrency 15
  • 16.
    ▪ We haveone knife ▪ Knife is shared by family members ▪ When one member is using the knife other must wait their turn Concurrency 16
  • 17.
    ▪ There are4 knives ▪ Family member can pick any one of the free knife of the 4 ▪ If all 4 are being used then they wait for one to become available Parallelism 17
  • 18.
    ▪ Managing ‘global’state is hard in multi cpu environments ▪ GIL -> Global interpreter lock When Concurrency in multi cpu (core) 18
  • 19.
    ▪ In pythonspawning threads is equivalent to concurrency ▪ Spawning processes is equivalent to parallelism Processes 19
  • 20.
    ▪ Each iterationspawns a new process ▪ Each process tries to open and read the file ▪ Multiple copies of file kept in memory Spawning Process 20
  • 21.
    ▪ Communication betweenprocesses (IPC) ▪ Shared memory ▪ Message Queues ▪ Very difficult to handle Avoiding Multiple Copies 21
  • 22.
  • 23.
  • 24.
    “The crawl archivefor October 2018 is now available! It contains 3.0 billion web pages and 240 TiB of uncompressed content, crawled between October 15th and 24th. 24
  • 25.
  • 26.
  • 27.
    Vertical Scaling 27 ▪ Keepupgrading the a machine to accomodate large data ▪ AWS instances of 384 RAM and 96 cores ▪ Limits on how much upgrades can be done ▪ High cost -> $6.048 / hour
  • 28.
    Horizontal Scaling 28 ▪ Scaleby adding commodity hardware ▪ Marginally less cost ▫ One 16 core 32GB RAM cost $0.4608 / hour ▫ We would need 12 of these to compete with vertical limits => $5.5296 / hour ▫ No limitation to how many nodes we add
  • 29.
    Horizontal Scaling 29 ▪ Parallelismacross node ▪ Complexity of managing is compounded ▪ Need for easier semantics for working with distributed nodes
  • 30.
  • 31.
    ▪ Map-Reduce likeOO is a programming paradigm ▪ It is well suited for parallel distributed data processing ▪ A central principle in Functional Programming A Programming Paradigm 31
  • 32.
    ▪ Programs arewritten as a series of map and reduce phases A Programming Paradigm 32
  • 33.
    ▪ Find howmany times a link is referenced in other websites to find it’s rank (page rank) Example 33
  • 34.
    ▪ For eachdocument find all <a href> tags and extract them as a list ▪ Iterate through list and find the previous count from a map ▫ If previous count present, set newCount = previousCount + 1 ▫ Else, newCount = 1 Iterative Style - One Document at a time 34
  • 35.
    ▪ Previous strategyfails as we will have to share map across process ▪ Locks have to be acquired adding additional delays Iterative Style - Multiple Document at a time (Multiple process) 35
  • 36.
    ▪ Each processworks with it’s own map ▪ When process is done, it emits it’s response ▪ Once all documents are mapped ▪ We start combining all the counts in each map this is the reduce stage Map Reduce - Multiple Document at a time (Multiple process) 36
  • 37.
    <h1>Document 1</h1> <a href="www.google.com">Google</a> <ahref="www.ssn.edu.in">Google</a> <a href="www.yahoo.com">Google</a> <a href="www.ssn.edu.in">Google</a> Map Reduce - Multiple Document at a time (Multiple process) <h1>Document 2</h1> <a href="www.google.com">Google</a> <a href="www.github.com">Google</a> 37 www.google.com 1 www.ssn.edu.in 2 www.yahoo.com 1 www.google.com 1 www.github.com 2
  • 38.
    Map Reduce -Multiple Document at a time (Multiple process) 38 www.google.com 1 www.ssn.edu.in 2 www.yahoo.com 1 www.google.com 1 www.github.com 2 www.google.com 2 www.ssn.edu.in 2 www.yahoo.com 1 www.github.com 1
  • 39.
  • 40.
    THANKS! Any questions? You canfind me at: 40 @snithish_ nithishsankaranarayanan@gmail.com