94. 94
Processing Engines
ā¢ MapReduce
ā¢ Abstractions
ā¢ Spark
ā¢ Spark Streaming
ā¢ Impala
Confidentiality Information Goes Here
95. 95
MapReduce
ā¢ Oldie but goody
ā¢ Restrictive Framework / Innovated Work Around
ā¢ Extreme Batch
Confidentiality Information Goes Here
96. 96
MapReduce Basic High Level
Confidentiality Information Goes Here
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File
97. 97
MapReduce Innovation
ā¢ Mapper Memory Joins
ā¢ Reducer Memory Joins
ā¢ Buckets Sorted Joins
ā¢ Cross Task Communication
ā¢ Windowing
ā¢ And Much More
Confidentiality Information Goes Here
98. 98
Abstractions
ā¢ SQL
ā Hive
ā¢ Script/Code
ā Pig: Pig Latin
ā Crunch: Java/Scala
ā Cascading: Java/Scala
Confidentiality Information Goes Here
99. 99
Spark
ā¢ The New Kid that isnāt that New Anymore
ā¢ Easily 10x less code
ā¢ Extremely Easy and Powerful API
ā¢ Very good for machine learning
ā¢ Scala, Java, and Python
ā¢ RDDs
ā¢ DAG Engine
Confidentiality Information Goes Here
100. 100
Spark - DAG
Confidentiality Information Goes Here
101. 101
Spark - DAG
Confidentiality Information Goes Here
Filter KeyBy
KeyBy
TextFile
TextFile
Join Filter Take
102. 102
Spark - DAG
Confidentiality Information Goes Here
Filter KeyBy
KeyBy
TextFile
TextFile
Join Filter Take
Good
Good
Good
Good
Good
Good
Good-Replay
Good-Replay
Good-Replay
Good
Good-Replay
Good
Good-Replay
Lost Block
Replay
Good-Replay
Lost Block
Good
Future
Future
Future
Future
103. 103
Spark Streaming
ā¢ Calling Spark in a Loop
ā¢ Extends RDDs with DStream
ā¢ Very Little Code Changes from ETL to Streaming
Confidentiality Information Goes Here
104. 104
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
105. 105
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver RDD
RDD
RDD
Stateful RDD 1
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful RDD 1
Stateful RDD 2
Print
106. 106
Impala
Confidentiality Information Goes Here
ā¢ MPP Style SQL Engine on top of Hadoop
ā¢ Very Fast
ā¢ High Concurrency
107. 107
Impala ā Broadcast Join
Confidentiality Information Goes Here
Impala Daemon
Smaller Table
Data Block
100% Cached
Smaller Table
Smaller Table
Data Block
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
100% Cached
Smaller Table
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
Impala Daemon
Hash Join Function
Bigger Table
Data Block
100% Cached
Smaller Table
Output
108. 108
Impala ā Broadcast Join
Confidentiality Information Goes Here
Impala Daemon
Smaller Table
Data Block
~33% Cached
Smaller Table
Impala Daemon
Smaller Table
Data Block
~33% Cached
Smaller Table
Impala Daemon
~33% Cached
Smaller Table
Hash Partitioner Hash Partitioner
Impala Daemon
BiggerTable
Data Block
Impala Daemon Impala Daemon
Hash Partitioner
Hash Join Function
33% Cached
Smaller Table
Hash Join Function
33% Cached
Smaller Table
Hash Join Function
33% Cached
Smaller Table
Output Output Output
Hash Partitioner
BiggerTable
Data Block
Hash Partitioner
BiggerTable
Data Block
109. 109
Impala vs Hive
Confidentiality Information Goes Here
ā¢ Very different approaches and
ā¢ We may see convergence at some point
ā¢ But for now
ā Impala for speed
ā Hive for batch
113. 113
Why sessionize?
Helps answers questions like:
ā¢ What is my websiteās bounce rate?
ā i.e. how many % of visitors donāt go past the landing page?
ā¢ Which marketing channels (e.g. organic search, display ad, etc.) are
leading to most sessions?
ā Which ones of those lead to most conversions (e.g. people buying things,
Confidentiality Information Goes Here
signing up, etc.)
ā¢ Do attribution analysis ā which channels are responsible for most
conversions?
114. 114
Sessionization
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36ā
244.157.45.12+1413580110
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/
1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5;
en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0
Mobile Safari/533.1ā 244.157.45.12+1413583199
Confidentiality Information Goes Here
115. 115
How to Sessionize?
Confidentiality Information Goes Here
1. Given a list of clicks, determine which clicks
came from the same user (Partitioning,
ordering)
2. Given a particular user's clicks, determine if a
given click is a part of a new session or a
continuation of the previous session (Identifying
session boundaries)
129. 129
Orchestrating Clickstream
ā¢ Data arrives through Flume
ā¢ Triggers a processing event:
ā Sessionize
ā Enrich ā Location, marketing channelā¦
ā Store as Parquet
ā¢ Each day we process events from the previous day
155. 155
Visit us at
Booth
#305
Highlights:
Hear whatās new with
5.2 including Impala 2.0
Learn how Cloudera is
setting the standard for
Hadoop in the Cloud
BOOK SIGNINGS THEATER SESSIONS
TECHNICAL DEMOS GIVEAWAYS