Lightening Fast Big Data Analytics using Apache Spark

www.unicomlearning.com

Lightning Fast Big Data Analytics using
Apache Spark
Manish Gupta
Solutions Architect – Product Engineering and Development
30th Jan 2014 - Delhi

www.bigdatainnovation.org



Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model

Example & Demo
Spark Current Users & Roadmap



What is Hadoop?
It’s an open-sourced software for distributed storage of large datasets on commodity
class hardware in a highly fault-tolerant, scalable and a flexible way.
HDFS
It also provide a programming model/framework for processing these large datasets
in a massively-parallel, fault-tolerant and data-location aware fashion.
MR
Map

Input

Reduce

Map
Map

Output
Reduce



Limitations of Map Reduce
HDFS
read

HDFS
write

HDFS
read

iter. 1

Input

Map

iter. 2

Map

. . .

Reduce

Map

Input

HDFS
write

Output
Reduce

 Slow due to replication, serialization, and disk IO
 Inefficient for:
•

Iterative algorithms (Machine Learning, Graphs & Network Analysis)

•

Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)



Approach: Leverage Memory?
 Memory bus >> disk & SSDs
 Many datasets fit into memory
 1TB = 1 billion records @ 1 KB
 Memory Capacity also follows the
Moore’s Law

A single 8GB stick of RAM is about
$80 right now. In 2021, you’d be
able to buy a single stick of RAM
that contains 64GB for the same
price.



Spark
“A big data analytics cluster-computing framework written in Scala.”
 Open Sourced originally developed in AMPLab at UC Berkley.

 Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).
 Designed for running Iterative algorithms & Interactive analytics
 Highly compatible with Hadoop’s Storage APIs.
 - Can run on your existing Hadoop Cluster Setup.
 Developers can write driver programs using multiple programming languages.

…



Spark
Spark Driver (Master)

Cluster Manager
Cache

Cache

Cache

Spark
Worker
Datanode

Datanode

Block

....
....

Spark
Worker

Block

Spark
Worker
Datanode
Block

HDFS



Spark
HDFS
read

HDFS
write

iter. 1

Input

HDFS
read

HDFS
write

iter. 2

. . .



Spark
HDFS
read

iter. 1

iter. 2

. . .

Input

Not tied to 2 stage Map
Reduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
Logistic regression in Hadoop and Spark



Spark
A simple analytical operation:
1

pagecount = spark.textFile( "/wiki/pagecounts“ )
pagecount.count()

2

englishPages = pagecount.filter( _.split(" ")(1) == "en“ )
englishPages.cache()
englishPages.count()
englishTuples = englishPages.map( line => line.split(" ") )
englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) )
englishKeyValues.reduceByKey( _+_, 1).collect

Select count(*)
from pagecounts
Select Col1, sum(Col4)
from pagecounts
Where Col2 = “en”
Group by Col1



Shark
 HIVE on SPARK = SHARK
 A large scale data warehouse system just like Apache Hive.
 Highly compatible with Hive (HQL, metastore, serialization formats, and
UDFs)
 Built on top of Spark (thus a faster execution engine).
 Provision of creating In-memory materialized tables (Cached Tables).
 And cached tables utilizes columnar storage instead of raw storage.

Row Storage

Column Storage

1

ABC

4.1

1

2

3

2

XYZ

3.5

ABC

XYZ

PPP

3

PPP

6.4

4.1

3.5

6.4



Shark
HIVE
Client

CLI

JDBC

Driver
Meta store

SQL
Parser

Query
Optimizer

Map Reduce
HDFS

Physical Plan
Execution



Shark
SHARK
Client

CLI

Driver
Meta store

SQL
Parser

Query
Optimizer

Spark

HDFS

JDBC

Cache Mgr.
Physical Plan
Execution



Spark Programming Model
Driver Program
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map

Cluster
Manager

SparkContext

Worker Node

Writes

Executer

Task

Worker Node

Cache

Executer

Task

Datanode

Task

…

User (Developer)
HDFS

Cache
Task

Datanode



Spark Programming Model
Driver Program
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map

Writes

User (Developer)

RDD
(Resilient
Distributed
Dataset)

•
•
•
•
•
•

Immutable Data structure
In-memory (explicitly)
Fault Tolerant
Parallel Data Structure
Controlled partitioning to
optimize data placement
Can be manipulated using
rich set of operators.



RDD
 Programming Interface: Programmer can perform 3 types of operations:
Transformations
•

Create a new
dataset from and
existing one.

•

Actions
•

Lazy in nature. They
are executed only
when some action is
performed.
•

•

Example :
• Map(func)
• Filter(func)
• Distinct()

Returns to the
driver program a
value or exports
data to a storage
system after
performing a
computation.
Example:
• Count()
• Reduce(funct)
• Collect
• Take()

Persistence
•

For caching datasets
in-memory for
future operations.

•

Option to store on
disk or RAM or
mixed (Storage
Level).

•

Example:
• Persist()
• Cache()



Spark
How Spark Works:
RDD: Parallel collection with partitions
 User application create RDDs, transform
them, and run actions.
This results in a DAG (Directed Acyclic Graph) of
operators.
DAG is compiled into stages
Each stage is executed as a series of Task (one
Task for each Partition).



Spark
Example:
sc.textFile(“/wiki/pagecounts”)

textFile

RDD[String]



Spark
Example:
.map(line => line.split(“t”))

textFile

map

RDD[String]
RDD[List[String]]



Spark
Example:
.map(R => (R[0], int(R[1])))

textFile

map

map

RDD[String]
RDD[List[String]]
RDD[(String, Int)]



Spark
Example:
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)

textFile

map

map

RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]

reduceByKey



Spark
Example:
.map(R => (R[0], int(R[1])))
.reduceByKey(_+_, 3)
.collect()

RDD[String]
RDD[List[String]]
RDD[(String, Int)]
RDD[(String, Int)]
Array[(String, Int)]

collect
textFile

map

map

reduceByKey



Spark
Execution Plan:

collect
textFile

map

map

reduceByKey

Above logical plan gets compiled by the DAG
scheduler into a Plan comprising of Stages
as…



Spark
Execution Plan:
Stage 2

Stage 1

collect
textFile

map

map

reduceByKey

Stages are sequences of RDDs, that don’t have a Shuffle in
between



Spark
Stage 2

Stage 1

collect
textFile

1.
2.
3.
4.

map

map

reduceByKey

1.
2.
3.

Read HDFS split
Apply both the maps
Start Partial reduce
Write shuffle data

Stage 1

Stage 2

Read shuffle data
Final reduce
Send result to driver
program



Spark
Stage Execution:
Stage 1
Task 1
Task 2
Task 2
Task 2

 Create a task for each Partition in the new RDD
 Serialize the Task
 Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)



Spark
Task Execution:
Task is the fundamental unit of execution in Spark

Fetch Input
HDFS /
RDD

Execute Task
Write Output
time

HDFS / RDD /
intermediate
shuffle output



Spark
Spark Executor (Slaves)
Fetch Input

Core 1

Fetch Input

Execute Task

Fetch Input

Execute Task

Write Output

Execute Task

Write Output

Fetch Input

Core 2

Write Output
Fetch Input

Execute Task

Execute Task

Write Output
Fetch Input

Core 3

Write Output
Fetch Input

Execute Task
Write Output

Execute Task
Write Output



Spark
Summary of Components
 Task : The fundamental unit of execution in Spark
 Stage: Set of Tasks that run parallel
 DAG : Logical Graph of RDD operations
 RDD : Parallel dataset with partitions



Example & Demo
Cluster Details:

 6 m1.Xlarge EC2 nodes.
 1 machine is Master Node
 5 worker node machines
 64 bit, 4 vCPU
 15 GB Ram



Example & Demo
Dataset:

 Wiki Page View Stats
 20 GB of webpage view counts
 3 days worth of data
<date_time> <project_code> <page_title> <num_hits> <page_size>
Base RDD to All Wiki Pages
val allPages = sc.textFile("/wiki/pagecounts")
allPages.take(10).foreach(println)
allPages.count()

Transformed RDD for all English pages (cached)
val englishPages = allPages.filter(_.split(" ")(1) == "en")
englishPages.cache()



Example & Demo
Dataset:

 Wiki Page View Stats
 20 GB of webpage view counts
 3 days worth of data
<date_time> <project_code> <page_title> <num_hits> <page_size>
Select date, sum(pageviews) from pagecounts group by date
englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println)

Select date, count(distinct pageURL) from pagecounts group by date
englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println)

Select distinct(datetime) from pagecounts order by datetime
englishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)



Example & Demo
Dataset:
 Network Datasets
 Directed and Bi-directed Graphs
 One small Facebook Social Network
 127 nodes (Friends)
 1668 Edges (Friendships)
 Bi-directed graph
 Google’s internal site network
 15713 Nodes (web pages)
 170845 Edges (hyperlinks)
 Directed Graph



Example & Demo
Page Rank Calculation:
•
•
•
•

Estimate the node importance
Each directed link from A -> B is a vote to B from A.
More links to a page, more important a page is.
When a page with higher PR, points to something, then it’s vote weighs more.

1.

Start each page at a rank of 1

2. On each iteration, have page p contribute (rank
of p) / (no. of neighbors of p) to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs



Example & Demo
Scala Code:
var iters = 100
val lines = sc.textFile("/dataset/google/edges.csv",1)
val links = lines.map{ s =>
val parts = s.split( "t“ )
(parts(0), parts(1))
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url => (url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1))
output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))

38 seconds
Page Rank
761.1985177
455.7028756
259.6052388
192.7257649
144.0349154
134.1566312
130.3546324
123.4014613
120.0661165
118.6884515
112.2309539
108.8375347
106.9724799
105.822426
105.1554798
99.97741309
97.90651416
90.7910291
90.70522689
87.4353413

Page URL
google
google/about.html
google/privacy.html
google/jobs/
google/support
google/terms_of_service.html
google/intl/en/about.html
google/imghp
google/accounts/Login
google/intl/en/options/
google/preferences
google/sitemap.html
google/press/
google/language_tools
google/support/toolbar/
google/maps
google/advanced_search
google/intl/en/services/
google/intl/en/ads/
google/adsense/



Spark Current Users & Roadmap

Source: Apache - Powered By Spark



Roadmap



Conclusion
 Because of In-memory processing, computations are very fast. Developers can
write iterative algorithms without writing out a result set after each pass
through the data.
 Suitable for scenarios when sufficient memory available in your cluster.
 It provides an integrated framework for advanced analytics like Graph
processing, Stream Processing, Machine Learning etc. This simplifies
integration.
 It’s community is expanding and development is happening very aggressively.
 It’s comparatively newer than Hadoop and only few users.


Topic:

Thank You
Speaker name: MANISH GUPTA
Email ID: manish.gupta@globallogic.com


Organized by
UNICOM Trainings & Seminars Pvt. Ltd.
contact@unicomlearning.com



Spark Internal Components
Spark core
Operators

Scheduler

Block manager

Networking

Accumulators

Interpreter

Broadcast

Hadoop I/O

Mesos backend

Standalone backend



In-Memory
But what if I run out of memory?
100

70

58.1

60

40.7

50

29.7

40
30

11.5

Iteration time (s)

80

68.8

90

20
10
0

Cache disabled

25%

50%

75%

% of working set in memory

Fully cached



Benchmarks
 AMPLab performed a quantitative and qualitative comparisons of 4
system
 HIVE, Impala, Redshift and Shark
 Done on Common Crawl Corpus Dataset
 81 TB size
 Consists of 3 tables:
 Page Rankings
 User Visits
 Documents
 Data was partitioned in such a way that each node had:
 25GB of User Visits
 1GB of Ranking
 30GB of Web Crawl (document)
Source: https://amplab.cs.berkeley.edu/benchmark/#



Benchmarks



Benchmarks
Hardware Configuration



Benchmarks

• Redshift outperforms for on-disk data.
• Shark and Impala outperform Hive by 3-4X.
• For larger result-sets, Shark outperforms Impala.



Benchmarks

• Redshift columnar storage outperforms every time.
• Shark in-memory is 2nd best in all cases.



Benchmarks
• Redshift bigger cluster has an advantage.
• Shark and Impala competing.



Benchmarks

• Impala & Redshift don’t have UDF.
• Shark outperforms hive.



Spark

In Last 6 months of Year 2013

Lightening Fast Big Data Analytics using Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Lightening Fast Big Data Analytics using Apache Spark

Similar to Lightening Fast Big Data Analytics using Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Lightening Fast Big Data Analytics using Apache Spark

Editor's Notes