SlideShare a Scribd company logo
Spark for Large
Scale Data
Analysis
@snithish_
@snithish
HELLO!
Developer @ Thoughtworks
Built and managed data lake for a retail enterprise
Avid learner of distributed systems
Co-maintainer of spark-fast-test
2
1.
We can analyze data
on a workstation
1.a.
The memory conundrum
▪ A single CSV file
▪ Typically small in size
▪ We load this file in python or R creating a
dataframe
▪ Manipulate this dataframe and visualize or
model
In the beginning
5
▪ What happens when we instruct python to open
a file and read it contents
Zooming in On file load
6
▪ File brought from disk to memory (mmap)
▪ Read from memory
▪ And written to screen
Zooming in On file load
7
▪ What if file size is larger than memory?
▪ File is divided into chunks and brought into
memory and once consumed removed from
memory
Zooming in On file load
8
Zooming in On file load
9
▪ Imagine running logistics regression on this
large file
▪ For each iteration the entire file would be
brought in memory and removed
▪ We call this Thrashing in OS parlance
Thrashing
10
▪ Despite efficient scheduling bring data from
persistent storage to memory is time
consuming
▪ If data is in a compressed format, we would
have to uncompress it
Resource wastage
11
Latency number (Jonas Bonér)
12
▪ Try to keep things in memory as much as
possible
Rule of Thumb
13
1.b.
The parallelism
conundrum
▪ Python and R don’t support parallelism out of
the box
▪ But python and R are concurrent
Parallelism vs Concurrency
15
▪ We have one knife
▪ Knife is shared by family members
▪ When one member is using the knife other must
wait their turn
Concurrency
16
▪ There are 4 knives
▪ Family member can pick any one of the free
knife of the 4
▪ If all 4 are being used then they wait for one to
become available
Parallelism
17
▪ Managing ‘global’ state is hard in multi cpu
environments
▪ GIL -> Global interpreter lock
When Concurrency in multi cpu (core)
18
▪ In python spawning threads is equivalent to
concurrency
▪ Spawning processes is equivalent to parallelism
Processes
19
▪ Each iteration spawns a new process
▪ Each process tries to open and read the file
▪ Multiple copies of file kept in memory
Spawning Process
20
▪ Communication between processes (IPC)
▪ Shared memory
▪ Message Queues
▪ Very difficult to handle
Avoiding Multiple Copies
21
2.
Internet Scale
▪ Volume
▪ Velocity
▪ Variety
3V ?
23
“The crawl archive for
October 2018 is now
available! It contains 3.0
billion web pages and 240
TiB of uncompressed
content, crawled between
October 15th and 24th.
24
25
Scaling
26
▪ Vertical scaling
▪ Horizontal scaling
Vertical Scaling
27
▪ Keep upgrading the a machine to accomodate
large data
▪ AWS instances of 384 RAM and 96 cores
▪ Limits on how much upgrades can be done
▪ High cost -> $6.048 / hour
Horizontal Scaling
28
▪ Scale by adding commodity hardware
▪ Marginally less cost
▫ One 16 core 32GB RAM cost $0.4608 / hour
▫ We would need 12 of these to compete with
vertical limits => $5.5296 / hour
▫ No limitation to how many nodes we add
Horizontal Scaling
29
▪ Parallelism across node
▪ Complexity of managing is compounded
▪ Need for easier semantics for working with
distributed nodes
3.
Map Reduce
▪ Map-Reduce like OO is a programming
paradigm
▪ It is well suited for parallel distributed data
processing
▪ A central principle in Functional Programming
A Programming Paradigm
31
▪ Programs are written as a series of map and
reduce phases
A Programming Paradigm
32
▪ Find how many times a link is referenced in
other websites to find it’s rank (page rank)
Example
33
▪ For each document find all <a href> tags and
extract them as a list
▪ Iterate through list and find the previous count
from a map
▫ If previous count present, set newCount =
previousCount + 1
▫ Else, newCount = 1
Iterative Style - One Document at a time
34
▪ Previous strategy fails as we will have to share
map across process
▪ Locks have to be acquired adding additional
delays
Iterative Style - Multiple Document at a time (Multiple process)
35
▪ Each process works with it’s own map
▪ When process is done, it emits it’s response
▪ Once all documents are mapped
▪ We start combining all the counts in each map
this is the reduce stage
Map Reduce - Multiple Document at a time (Multiple process)
36
<h1>Document 1</h1>
<a href="www.google.com">Google</a>
<a href="www.ssn.edu.in">Google</a>
<a href="www.yahoo.com">Google</a>
<a href="www.ssn.edu.in">Google</a>
Map Reduce - Multiple Document at a time (Multiple
process)
<h1>Document 2</h1>
<a href="www.google.com">Google</a>
<a href="www.github.com">Google</a>
37
www.google.com 1
www.ssn.edu.in 2
www.yahoo.com 1
www.google.com 1
www.github.com 2
Map Reduce - Multiple Document at a time (Multiple
process)
38
www.google.com 1
www.ssn.edu.in 2
www.yahoo.com 1
www.google.com 1
www.github.com 2
www.google.com 2
www.ssn.edu.in 2
www.yahoo.com 1
www.github.com 1
4.
Let’s take Spark for a
spin
THANKS!
Any questions?
You can find me at:
40
@snithish_
nithishsankaranarayanan@gmail.com

More Related Content

What's hot

Distributed Timeseries Database In Go (gophercon India 17)
Distributed Timeseries Database In Go (gophercon India 17)Distributed Timeseries Database In Go (gophercon India 17)
Distributed Timeseries Database In Go (gophercon India 17)
Matthew Campbell
 
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS HostingDAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
Rui Liu
 
RubiX
RubiXRubiX
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
lohitvijayarenu
 
Gluster for sysadmins
Gluster for sysadminsGluster for sysadmins
Gluster for sysadmins
Gluster.org
 
Monitoring Cassandra With An EYE
Monitoring Cassandra With An EYEMonitoring Cassandra With An EYE
Monitoring Cassandra With An EYE
Knoldus Inc.
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
PingCAP
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
Join the super_colony_-_feb2013
Join the super_colony_-_feb2013Join the super_colony_-_feb2013
Join the super_colony_-_feb2013
Gluster.org
 
State of the_gluster_-_lceu
State of the_gluster_-_lceuState of the_gluster_-_lceu
State of the_gluster_-_lceu
Gluster.org
 
Developing apps and_integrating_with_gluster_fs_-_libgfapi
Developing apps and_integrating_with_gluster_fs_-_libgfapiDeveloping apps and_integrating_with_gluster_fs_-_libgfapi
Developing apps and_integrating_with_gluster_fs_-_libgfapi
Gluster.org
 
Lcna example-2012
Lcna example-2012Lcna example-2012
Lcna example-2012
Gluster.org
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdose
Gluster.org
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PingCAP
 
Sdc 2012-challenges
Sdc 2012-challengesSdc 2012-challenges
Sdc 2012-challenges
Gluster.org
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
Lisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introductionLisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introduction
Gluster.org
 
Kkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summitKkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summit
Gluster.org
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
Alluxio, Inc.
 
Cimagraphi8
Cimagraphi8Cimagraphi8
Cimagraphi8
Pablo Vilanez
 

What's hot (20)

Distributed Timeseries Database In Go (gophercon India 17)
Distributed Timeseries Database In Go (gophercon India 17)Distributed Timeseries Database In Go (gophercon India 17)
Distributed Timeseries Database In Go (gophercon India 17)
 
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS HostingDAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
 
RubiX
RubiXRubiX
RubiX
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
 
Gluster for sysadmins
Gluster for sysadminsGluster for sysadmins
Gluster for sysadmins
 
Monitoring Cassandra With An EYE
Monitoring Cassandra With An EYEMonitoring Cassandra With An EYE
Monitoring Cassandra With An EYE
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Join the super_colony_-_feb2013
Join the super_colony_-_feb2013Join the super_colony_-_feb2013
Join the super_colony_-_feb2013
 
State of the_gluster_-_lceu
State of the_gluster_-_lceuState of the_gluster_-_lceu
State of the_gluster_-_lceu
 
Developing apps and_integrating_with_gluster_fs_-_libgfapi
Developing apps and_integrating_with_gluster_fs_-_libgfapiDeveloping apps and_integrating_with_gluster_fs_-_libgfapi
Developing apps and_integrating_with_gluster_fs_-_libgfapi
 
Lcna example-2012
Lcna example-2012Lcna example-2012
Lcna example-2012
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdose
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
Sdc 2012-challenges
Sdc 2012-challengesSdc 2012-challenges
Sdc 2012-challenges
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Lisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introductionLisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introduction
 
Kkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summitKkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summit
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 
Cimagraphi8
Cimagraphi8Cimagraphi8
Cimagraphi8
 

Similar to Why Spark for large scale data analysis

lecture 8 b main memory
lecture 8 b main memorylecture 8 b main memory
lecture 8 b main memory
ITNet
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
kul prasad subedi
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
Ofir Nir
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in Spark
ChristophePraud2
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 
Scaling Up with PHP and AWS
Scaling Up with PHP and AWSScaling Up with PHP and AWS
Scaling Up with PHP and AWS
Heath Dutton ☕
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 

Similar to Why Spark for large scale data analysis (20)

lecture 8 b main memory
lecture 8 b main memorylecture 8 b main memory
lecture 8 b main memory
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Control dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in SparkControl dataset partitioning and cache to optimize performances in Spark
Control dataset partitioning and cache to optimize performances in Spark
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
 
Scaling Up with PHP and AWS
Scaling Up with PHP and AWSScaling Up with PHP and AWS
Scaling Up with PHP and AWS
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 

Recently uploaded

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Why Spark for large scale data analysis

  • 1. Spark for Large Scale Data Analysis @snithish_ @snithish
  • 2. HELLO! Developer @ Thoughtworks Built and managed data lake for a retail enterprise Avid learner of distributed systems Co-maintainer of spark-fast-test 2
  • 3. 1. We can analyze data on a workstation
  • 5. ▪ A single CSV file ▪ Typically small in size ▪ We load this file in python or R creating a dataframe ▪ Manipulate this dataframe and visualize or model In the beginning 5
  • 6. ▪ What happens when we instruct python to open a file and read it contents Zooming in On file load 6
  • 7. ▪ File brought from disk to memory (mmap) ▪ Read from memory ▪ And written to screen Zooming in On file load 7
  • 8. ▪ What if file size is larger than memory? ▪ File is divided into chunks and brought into memory and once consumed removed from memory Zooming in On file load 8
  • 9. Zooming in On file load 9
  • 10. ▪ Imagine running logistics regression on this large file ▪ For each iteration the entire file would be brought in memory and removed ▪ We call this Thrashing in OS parlance Thrashing 10
  • 11. ▪ Despite efficient scheduling bring data from persistent storage to memory is time consuming ▪ If data is in a compressed format, we would have to uncompress it Resource wastage 11
  • 12. Latency number (Jonas Bonér) 12
  • 13. ▪ Try to keep things in memory as much as possible Rule of Thumb 13
  • 15. ▪ Python and R don’t support parallelism out of the box ▪ But python and R are concurrent Parallelism vs Concurrency 15
  • 16. ▪ We have one knife ▪ Knife is shared by family members ▪ When one member is using the knife other must wait their turn Concurrency 16
  • 17. ▪ There are 4 knives ▪ Family member can pick any one of the free knife of the 4 ▪ If all 4 are being used then they wait for one to become available Parallelism 17
  • 18. ▪ Managing ‘global’ state is hard in multi cpu environments ▪ GIL -> Global interpreter lock When Concurrency in multi cpu (core) 18
  • 19. ▪ In python spawning threads is equivalent to concurrency ▪ Spawning processes is equivalent to parallelism Processes 19
  • 20. ▪ Each iteration spawns a new process ▪ Each process tries to open and read the file ▪ Multiple copies of file kept in memory Spawning Process 20
  • 21. ▪ Communication between processes (IPC) ▪ Shared memory ▪ Message Queues ▪ Very difficult to handle Avoiding Multiple Copies 21
  • 23. ▪ Volume ▪ Velocity ▪ Variety 3V ? 23
  • 24. “The crawl archive for October 2018 is now available! It contains 3.0 billion web pages and 240 TiB of uncompressed content, crawled between October 15th and 24th. 24
  • 25. 25
  • 27. Vertical Scaling 27 ▪ Keep upgrading the a machine to accomodate large data ▪ AWS instances of 384 RAM and 96 cores ▪ Limits on how much upgrades can be done ▪ High cost -> $6.048 / hour
  • 28. Horizontal Scaling 28 ▪ Scale by adding commodity hardware ▪ Marginally less cost ▫ One 16 core 32GB RAM cost $0.4608 / hour ▫ We would need 12 of these to compete with vertical limits => $5.5296 / hour ▫ No limitation to how many nodes we add
  • 29. Horizontal Scaling 29 ▪ Parallelism across node ▪ Complexity of managing is compounded ▪ Need for easier semantics for working with distributed nodes
  • 31. ▪ Map-Reduce like OO is a programming paradigm ▪ It is well suited for parallel distributed data processing ▪ A central principle in Functional Programming A Programming Paradigm 31
  • 32. ▪ Programs are written as a series of map and reduce phases A Programming Paradigm 32
  • 33. ▪ Find how many times a link is referenced in other websites to find it’s rank (page rank) Example 33
  • 34. ▪ For each document find all <a href> tags and extract them as a list ▪ Iterate through list and find the previous count from a map ▫ If previous count present, set newCount = previousCount + 1 ▫ Else, newCount = 1 Iterative Style - One Document at a time 34
  • 35. ▪ Previous strategy fails as we will have to share map across process ▪ Locks have to be acquired adding additional delays Iterative Style - Multiple Document at a time (Multiple process) 35
  • 36. ▪ Each process works with it’s own map ▪ When process is done, it emits it’s response ▪ Once all documents are mapped ▪ We start combining all the counts in each map this is the reduce stage Map Reduce - Multiple Document at a time (Multiple process) 36
  • 37. <h1>Document 1</h1> <a href="www.google.com">Google</a> <a href="www.ssn.edu.in">Google</a> <a href="www.yahoo.com">Google</a> <a href="www.ssn.edu.in">Google</a> Map Reduce - Multiple Document at a time (Multiple process) <h1>Document 2</h1> <a href="www.google.com">Google</a> <a href="www.github.com">Google</a> 37 www.google.com 1 www.ssn.edu.in 2 www.yahoo.com 1 www.google.com 1 www.github.com 2
  • 38. Map Reduce - Multiple Document at a time (Multiple process) 38 www.google.com 1 www.ssn.edu.in 2 www.yahoo.com 1 www.google.com 1 www.github.com 2 www.google.com 2 www.ssn.edu.in 2 www.yahoo.com 1 www.github.com 1
  • 39. 4. Let’s take Spark for a spin
  • 40. THANKS! Any questions? You can find me at: 40 @snithish_ nithishsankaranarayanan@gmail.com