INTERACTIVE ANALYTICS ON ‘BIG DATA’ USING
APACHE SPARK
DIPA DUBHASHI
HEAD - PRODUCT MANAGEMENT, SIGMOID
@ PUNE APACHE SPARK MEETUP
OCTOBER 8, 2015
APPROACH
USE-CASE
DATA
SOURCES
TECH-
STACK
ARCHITECTURE
DECISION
TECHNOLOGY
STACK
Proprietary
(Outsource)
Open-source
(Outsource)
Open-source
(Implement
yourself)
Need urgently.
Have Money
Willing to wait.
Don’t have time/expertise
Willing to wait.
Have time & expertise
AD-TECH CASE
STUDY
USE CASE
REAL-TIME AD
CAMPAIGN
PERFORMANCE
TRACKING
DATA
SOURCES
IMPRESSION, CLICK,
CONVERSION LOGS
TECH
STACK
?
CRITERIA
REQUIREMEN
TS
RESPONSE
TIME < 10s
INTERACTIVE
UI
COST
EFFECTIVE
DATA
VOLUME –
100TB
 Good, consistent performance
 Targeted at business users
 Scalable, reliable architecture
 Demonstrable ROI
COMPONENTS
INGEST TRANSFORM STORE PROCESS VISUALIZE
SPARKHDFS
INFRASTRUCTURE
STORAGE
MINIMIZE
STORAGE
FOOTPRINT
COMPRESSED
DISTRIBUTED
COST
EFFECTIVE
COLUMNAR
 Store less data
 Store it in an appropriate format
 Store it across multiple machines
 Store it cheaply
FILE SYSTEM
CHOICE
HDFS
SCALABLE RELIABLE
COST-
EFFECTIVE
FLEXIBLE
FILE FORMAT
CHOICEPARQUET - Efficient general-purpose columnar file format for Hado
ROW GROUPS
COLUMN CHUNKS
The row group is a fixed size. As the number of columns increases, the row group is divided into smaller chunks.
Row groups keep all columns of each record in the same HDFS block so records can be reassembled from a single block.
Image Source - http://ingest.tips/2015/01/31/parquet-row-group-size/
EFFICIENT
READS
Image Source - http://www.slideshare.net/databricks/yin-huai-20150325meetupwithdemos
1Support in Parquet
PROCESSING
BETTER
PERFORMAN
CE
SPEED
PARALLELIZE
OPTIMIZER
OPERATOR
SUPPORT
 Process less data, fast
 Support for required operators
 Distribute the computation
 Optimize the execution chain
COMPUTATION ENGINE
CHOICE
SPARK
EFFICIENT
INTEROPERABI
LITY
OPTIMIZER USABILITY
EFFICIENT
COMPUTATIONS
joined = users.join(events, user.id == events.uid)
filtered = joined.filter(events.date > “2015-01-01”)
Image Source - http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science
BENCHMARK (1.5TB data)
Query description Sigmoid x1 x2 x3 x4
select hour,sum(m1) from table where hour>='2014-04-01-00' and hour<'2014-04-
08-00' group by hour order by hour 0.8s (1 core) 18s 1.2s 1.4s 2.1s
select hour,sum(m1),sum(m2),…,sum(m18) from table where hour>='2014-04-01-
00' and hour<'2014-04-08-00' group by hour order by hour 0.8s (1 core) 28s 6.3s 8.7s 5-10s
select hour, sum(m1),sum(m2),…,sum(m18) from table where hour>='2014-04-01-
00' and hour<'2014-04-08-00' and site=’x' group by hour 0.9s (1 core) 21s 3.2s 3.6s 4.3s
select hour, sum(m1),sum(m2),…,sum(m18) from table where hour>='2014-04-01-
00' and hour<'2014-04-08-00' and account_id=’a' and brand_id=’b' group by hour 2s (24 cores) 38s 2.8s 0.9s 0.6s
select account, sum(m1),sum(m2)…sum(m18) from table where hour>='2014-04-
01-00' and hour<'2014-04-08-00' group by account 1.6s (1 core) 35s 7.7s 0.9s 25s
select site, sum(m1),sum(m2)…sum(m18) from table where hour>='2014-04-01-00'
and hour<'2014-04-08-00' group by site 1.6s (2 cores) 28s 7.8s - 25s
select account, sum(m1) from table where hour>='2014-04-01-00' and hour<'2014-
04-08-00' group by account 1s (1 core) 17s 2.4s 1.2s 10s
select account, sum(m1),sum(m2)…,sum(m18) from table where hour>='2014-04-
01-00' and hour<'2014-04-08-00' and site =’x’' group by account 3.2s (2 cores) 26s 3.4s 1.3s 3.8s
select site, sum(m1),sum(m2)…,sum(m18) from table where hour>='2014-04-01-00'
and hour<'2014-04-08-00' and account =‘a' and brand_id =’b' group by site 3.2s (14 cores) 38s 12.7s 1.6s 4.3s
VISUALIZATION /
DEMO

Dipa Dubhashi's presentation