SlideShare a Scribd company logo
1 of 48
Download to read offline
1 
ApplicaAon 
Architectures 
with 
Apache 
Headline 
Hadoop 
Goes 
Here 
Mark 
Grover 
| 
@mark_grover 
Speaker 
Name 
or 
Subhead 
Goes 
Here 
NYC 
HUG 
slideshare.com/markgrover 
October 
14th, 
2014 
DO 
NOT 
USE 
PUBLICLY 
PRIOR 
TO 
10/23/12 
©2014 Cloudera, Inc. All Rights 
Reserved.
About 
Me 
• CommiPer 
on 
Apache 
Bigtop, 
commiPer 
and 
PPMC 
member 
on 
Apache 
Sentry 
(incubaAng). 
• Contributor 
to 
Hadoop, 
Hive, 
Spark, 
Sqoop, 
Flume. 
• SoWware 
developer 
at 
Cloudera 
• @mark_grover 
2 ©2014 Cloudera, Inc. All Rights 
Reserved.
Co-­‐authoring 
O’Reilly 
book 
• @hadooparchbook 
• hadooparchitecturebook.com 
• Strata 
Hadoop 
World 
Tutorial 
• at 
9 
AM 
tomorrow 
©2014 Cloudera, Inc. All Rights 
3 Reserved.
4 
Case 
Study 
Click 
Stream 
Analysis 
©2014 Cloudera, Inc. All Rights 
Reserved.
AnalyAcs 
©2014 Cloudera, Inc. All Rights 
5 
Reserved.
AnalyAcs 
©2014 Cloudera, Inc. All Rights 
6 
Reserved.
AnalyAcs 
©2014 Cloudera, Inc. All Rights 
7 
Reserved.
AnalyAcs 
©2014 Cloudera, Inc. All Rights 
8 
Reserved.
AnalyAcs 
©2014 Cloudera, Inc. All Rights 
9 
Reserved.
AnalyAcs 
©2014 Cloudera, Inc. All Rights 
10 
Reserved.
AnalyAcs 
©2014 Cloudera, Inc. All Rights 
11 
Reserved.
Web 
Logs 
– 
Combined 
Log 
Format 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? 
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" 
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ 
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile 
Safari/533.1” 
©2014 Cloudera, Inc. All Rights 
Reserved. 
12
Clickstream 
AnalyAcs 
©2014 Cloudera, Inc. All Rights 
Reserved. 
244.157.45.12 - - [17/Oct/ 
2014:21:08:30 ] "GET /seatposts 
HTTP/1.0" 200 4463 "http:// 
bestcyclingreviews.com/ 
top_online_shops" "Mozilla/5.0 
(Macintosh; Intel Mac OS X 
10_9_2) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/ 
36.0.1944.0 Safari/537.36” 
13
Similar 
use-­‐cases 
• Sensors 
– 
heart, 
agriculture, 
etc. 
• Casinos 
– 
session 
of 
a 
person 
at 
a 
table 
©2014 Cloudera, Inc. All Rights 
14 Reserved.
Challenges 
of 
Hadoop 
ImplementaAon 
©2014 Cloudera, Inc. All Rights 
15 
Reserved.
Challenges 
of 
Hadoop 
ImplementaAon 
©2014 Cloudera, Inc. All Rights 
16 
Reserved.
Other 
challenges 
-­‐ 
Architectural 
ConsideraAons 
• Storage 
managers? 
• HDFS? 
HBase? 
• Data 
storage 
and 
modeling: 
• File 
formats? 
Compression? 
Schema 
design? 
• Data 
movement 
• How 
do 
we 
actually 
get 
the 
data 
into 
Hadoop? 
How 
do 
we 
get 
it 
out? 
• Metadata 
• How 
do 
we 
manage 
data 
about 
the 
data? 
• Processing 
• How 
can 
we 
transform 
it? 
How 
do 
we 
query 
it? 
• OrchestraAon 
• How 
do 
we 
manage 
the 
workflow 
for 
all 
of 
this? 
©2014 Cloudera, Inc. All Rights 
17 Reserved.
18 
2. 
Processing 
Since 
that’s 
all 
what 
the 
Ame 
allows 
todayJ 
©2014 Cloudera, Inc. All Rights 
Reserved.
Processing 
• De-­‐duplicaAon 
• Filtering 
• SessionizaAon 
19 ©2014 Cloudera, Inc. All Rights 
Reserved.
DeduplicaAon 
– 
remove 
duplicate 
records 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
©2014 Cloudera, Inc. All Rights 
20 
Reserved.
Filtering 
– 
filter 
out 
invalid 
records 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? 
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" 
"Mozilla/5.0 (Linux; U… 
©2014 Cloudera, Inc. All Rights 
21 
Reserved.
SessionizaAon 
Website 
visit 
©2014 Cloudera, Inc. All Rights 
22 
Reserved. 
Visitor 
1 
Session 
1 
Visitor 
1 
Session 
2 
Visitor 
2 
Session 
1 
> 
30 
minutes
Why 
sessionize? 
Helps 
answers 
quesAons 
like: 
• What 
is 
my 
website’s 
bounce 
rate? 
• i.e. 
how 
many 
% 
of 
visitors 
don’t 
go 
past 
the 
landing 
page? 
• Which 
markeAng 
channels 
(e.g. 
organic 
search, 
display 
ad, 
etc.) 
are 
leading 
to 
most 
sessions? 
• Which 
ones 
of 
those 
lead 
to 
most 
conversions 
(e.g. 
people 
buying 
things, 
signing 
up, 
etc.) 
• Do 
aPribuAon 
analysis 
– 
which 
channels 
are 
responsible 
for 
most 
conversions? 
23 ©2014 Cloudera, Inc. All Rights 
Reserved.
SessionizaAon 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? 
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" 
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ 
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile 
Safari/533.1” 166 
©2014 Cloudera, Inc. All Rights 
24 
Reserved.
How 
to 
Sessionize? 
1. Given 
a 
list 
of 
clicks, 
determine 
which 
clicks 
came 
from 
the 
same 
user 
2. Given 
a 
parAcular 
user's 
clicks, 
determine 
if 
a 
given 
click 
is 
a 
part 
of 
a 
new 
session 
or 
a 
conAnuaAon 
of 
the 
previous 
session 
25 ©2014 Cloudera, Inc. All Rights 
Reserved.
#1 
– 
Which 
clicks 
are 
from 
same 
user? 
• We 
can 
use: 
• IP 
address 
(244.157.45.12) 
• Cookies 
(A9A3BECE0563982D) 
• IP 
address 
(244.157.45.12)and 
user 
agent 
string 
((KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/ 
537.36") 
26 ©2014 Cloudera, Inc. All Rights 
Reserved.
#1 
– 
Which 
clicks 
are 
from 
same 
user? 
• We 
can 
use: 
• IP 
address 
(244.157.45.12) 
• Cookies 
(A9A3BECE0563982D) 
• IP 
address 
(244.157.45.12)and 
user 
agent 
string 
((KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/ 
537.36") 
27 ©2014 Cloudera, Inc. All Rights 
Reserved.
#1 
– 
Which 
clicks 
are 
from 
same 
user? 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? 
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" 
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ 
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile 
Safari/533.1” 
©2014 Cloudera, Inc. All Rights 
28 
Reserved.
#2 
– 
Which 
clicks 
part 
of 
the 
same 
session? 
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? 
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" 
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ 
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile 
Safari/533.1” 
©2014 Cloudera, Inc. All Rights 
29 
Reserved. 
> 
30 
mins 
apart 
= 
different 
sessions
30 
SessionizaAon 
in 
MapReduce 
github.com/hadooparchitecturebook/clickstream-­‐tutorial 
©2014 Cloudera, Inc. All Rights 
Reserved.
SessionizaAon 
in 
MapReduce 
log 
IP2, 
log 
log 
lines 
31 ©2014 Cloudera, Inc. All Rights 
Reserved. 
Map 
Reduce 
Reduce 
Log 
line 
IP1, 
log 
lines 
IP1, 
lines 
Log 
line, 
session 
ID 
Map 
Map 
Log 
line 
Log 
line 
IP2, 
lines 
Log 
line, 
session 
ID
Mapper 
for 
SessionizaAon 
32 ©2014 Cloudera, Inc. All Rights 
Reserved. 
public 
static 
class 
SessionizeMapper 
extends 
Mapper<Object, 
Text, 
IpTimestampKey, 
Text> 
{ 
private 
Matcher 
logRecordMatcher; 
public 
void 
map(Object 
key, 
Text 
value, 
Context 
context 
) 
throws 
IOException, 
InterruptedException 
{ 
logRecordMatcher 
= 
logRecordPattern.matcher(value.toString()); 
// 
We 
only 
emit 
something 
out 
if 
the 
record 
matches 
with 
our 
regex. 
Otherwise, 
we 
assume 
the 
record 
is 
busted 
and 
simply 
ignore 
it 
if 
(logRecordMatcher.matches()) 
{ 
String 
ip 
= 
logRecordMatcher.group(1); 
DateTime 
timestamp 
= 
DateTime.parse(logRecordMatcher.group(2), 
TIMESTAMP_FORMATTER); 
Long 
unixTimestamp 
= 
timestamp.getMillis(); 
IpTimestampKey 
outputKey 
= 
new 
IpTimestampKey(ip, 
unixTimestamp); 
context.write(outputKey, 
value); 
} 
} 
}
Reducer 
for 
SessionizaAon 
public 
static 
class 
SessionizeReducer 
33 ©2014 Cloudera, Inc. All Rights 
Reserved. 
extends 
Reducer<IpTimestampKey, 
Text, 
IpTimestampKey, 
Text> 
{ 
private 
Text 
result 
= 
new 
Text(); 
public 
void 
reduce(IpTimestampKey 
key, 
Iterable<Text> 
values, 
Context 
context 
) 
throws 
IOException, 
InterruptedException 
{ 
// 
The 
sessionId 
generated 
here 
is 
per 
day, 
per 
IP. 
So, 
any 
queries 
// 
that 
will 
be 
done 
as 
if 
this 
session 
ID 
were 
global, 
would 
require 
// 
a 
combination 
of 
the 
day 
in 
question 
and 
IP 
as 
well. 
String 
sessionId 
= 
null; 
Long 
lastTimeStamp 
= 
null; 
for 
(Text 
value 
: 
values) 
{ 
String 
logRecord 
= 
value.toString();
Reducer 
for 
SessionizaAon 
// 
If 
this 
is 
the 
first 
record 
for 
this 
user 
or 
it's 
been 
more 
than 
the 
34 ©2014 Cloudera, Inc. All Rights 
Reserved. 
timeout 
since 
// 
the 
last 
click 
from 
this 
user, 
let's 
increment 
the 
session 
ID. 
if 
(lastTimeStamp 
== 
null 
|| 
(key.getUnixTimestamp() 
-­‐ 
lastTimeStamp 
> 
SESSION_TIMEOUT_IN_MS)) 
{ 
sessionId 
= 
key.getIp() 
+ 
"+" 
+ 
key.getUnixTimestamp(); 
} 
lastTimeStamp 
= 
key.getUnixTimestamp(); 
result.set(logRecord 
+ 
" 
" 
+ 
sessionId); 
// 
Since 
we 
only 
care 
about 
printing 
out 
the 
entire 
record 
in 
the 
result, 
with 
session 
ID 
appended 
// 
at 
the 
end, 
we 
just 
emit 
out 
"null" 
for 
the 
key 
context.write(null, 
result); 
} 
} 
}
Secondary 
sorAng 
– 
by 
Amestamp 
• Need 
records 
to 
reducer 
to 
be 
grouped 
by 
IP 
address 
and 
sorted 
by 
Amestamp 
– 
a 
concept 
called 
secondary 
sor/ng 
• Instead 
of 
using 
just 
IP 
address 
as 
map 
output 
key 
and 
reduce 
input 
key 
• We 
use 
a 
composite 
key 
(IP, 
Amestamp) 
as 
map 
output 
key 
and 
reduce 
input 
key 
35 ©2014 Cloudera, Inc. All Rights 
Reserved.
Secondary 
sorAng 
– 
vocabulary 
• Composite 
key 
– 
IP 
address, 
Amestamp 
• Natural 
key 
– 
IP 
address 
• Secondary 
sort 
key 
-­‐ 
Amestamp 
36 ©2014 Cloudera, Inc. All Rights 
Reserved.
Secondary 
sorAng 
• Custom 
Grouping 
Comparator 
– 
on 
Natural 
Key 
(IP) 
• Custom 
Sort 
Comparator 
– 
on 
Composite 
Key 
(IP, 
address) 
• Custom 
ParAAoner 
– 
on 
Natural 
Key 
(IP) 
job.setGroupingComparatorClass(NaturalKeyComparator.class 
); 
job.setSortComparatorClass(CompositeKeyComparator.class); 
job.setPartitionerClass(NaturalKeyPartitioner.class); 
37 ©2014 Cloudera, Inc. All Rights 
Reserved.
38 
Final 
Architecture 
©2014 Cloudera, Inc. All Rights 
Reserved.
Hadoop 
Cluster 
©2014 Cloudera, Inc. All Rights 
39 
Reserved. 
BI/VisualizaAon 
tool 
(e.g. 
microstrategy) 
BI 
Analysts 
Spark 
For 
machine 
learning 
and 
graph 
processing 
R/Python 
StaAsAcal 
Analysis 
Custom 
Apps 
3. 
Accessing 
2. 
Processing 
4. 
OrchestraAon 
Website 
users 
CRM 
System 
1. 
IngesAon 
via 
Oozie 
OperaAonal 
Data 
Store 
Via 
Sqoop 
Web 
servers
Final 
Architecture 
– 
High 
Level 
Overview 
40 
Data 
Sources 
IngesAon 
Data 
Storage/ 
Processing 
Data 
ReporAng/ 
Analysis 
©2014 Cloudera, Inc. All Rights 
Reserved.
Final 
Architecture 
– 
High 
Level 
Overview 
41 
Data 
Sources 
IngesAon 
Data 
Storage/ 
Processing 
Data 
ReporAng/ 
Analysis 
©2014 Cloudera, Inc. All Rights 
Reserved.
Final 
Architecture 
– 
IngesAon 
42 
Web 
App 
Avro 
Agent 
Web 
App 
Avro 
Agent 
Web 
App 
Avro 
Agent 
Web 
App 
Avro 
Agent 
Web 
App 
Avro 
Agent 
Web 
App 
Avro 
Agent 
Web 
App 
Avro 
Agent 
Web 
App 
Avro 
Agent 
Flume 
Agent 
Flume 
Agent 
Flume 
Agent 
Flume 
Agent 
Fan-­‐in 
PaPern 
MulA 
Agents 
for 
Failover 
and 
rolling 
restarts 
HDFS 
©2014 Cloudera, Inc. All Rights 
Reserved.
Final 
Architecture 
– 
High 
Level 
Overview 
43 
Data 
Sources 
IngesAon 
Data 
Storage/ 
Processing 
Data 
ReporAng/ 
Analysis 
©2014 Cloudera, Inc. All Rights 
Reserved.
Final 
Architecture 
– 
Storage 
and 
Processing 
/etl/weblogs/20140331/ 
/etl/weblogs/20140401/ 
… 
44 
Data 
Processing 
/data/markeAng/clickstream/bouncerate/ 
/data/markeAng/clickstream/aPribuAon/ 
… 
©2014 Cloudera, Inc. All Rights 
Reserved.
Final 
Architecture 
– 
High 
Level 
Overview 
45 
Data 
Sources 
IngesAon 
Data 
Storage/ 
Processing 
Data 
ReporAng/ 
Analysis 
©2014 Cloudera, Inc. All Rights 
Reserved.
Final 
Architecture 
– 
Data 
Access 
46 
Hive/ 
Impala 
BI/ 
AnalyAcs 
Tools 
JDBC/ODBC 
DWH 
Sqoop 
Local 
Disk 
R, 
etc. 
DB 
import 
tool 
©2014 Cloudera, Inc. All Rights 
Reserved.
Contact 
info 
• Mark 
Grover 
• @mark_grover 
• www.linkedin.com/in/grovermark 
• Slides 
at 
slideshare.net/markgrover 
47 ©2014 Cloudera, Inc. All Rights 
Reserved.
48 ©2014 Cloudera, Inc. All Rights 
Reserved.

More Related Content

What's hot

De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 

What's hot (20)

De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 

Viewers also liked

Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayDataWorks Summit
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Apekshit Sharma
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Introduction to Hadoop Developer Training Webinar
Introduction to Hadoop Developer Training WebinarIntroduction to Hadoop Developer Training Webinar
Introduction to Hadoop Developer Training WebinarCloudera, Inc.
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 
Spark徹底入門 #cwt2015
Spark徹底入門 #cwt2015Spark徹底入門 #cwt2015
Spark徹底入門 #cwt2015Cloudera Japan
 
Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Daniel Madrigal
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Introduction to Apache HBase Training
Introduction to Apache HBase TrainingIntroduction to Apache HBase Training
Introduction to Apache HBase TrainingCloudera, Inc.
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2IMC Institute
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeDataWorks Summit
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 

Viewers also liked (20)

Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Introduction to Hadoop Developer Training Webinar
Introduction to Hadoop Developer Training WebinarIntroduction to Hadoop Developer Training Webinar
Introduction to Hadoop Developer Training Webinar
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Spark徹底入門 #cwt2015
Spark徹底入門 #cwt2015Spark徹底入門 #cwt2015
Spark徹底入門 #cwt2015
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Introduction to Apache HBase Training
Introduction to Apache HBase TrainingIntroduction to Apache HBase Training
Introduction to Apache HBase Training
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 

Similar to NYC HUG - Application Architectures with Apache Hadoop

Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Oracle_Patching_Untold_Story_Final_Part2.pdf
Oracle_Patching_Untold_Story_Final_Part2.pdfOracle_Patching_Untold_Story_Final_Part2.pdf
Oracle_Patching_Untold_Story_Final_Part2.pdfAlex446314
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Keynote - Cloudy Vision: How Cloud Integration Complicates Security
Keynote - Cloudy Vision: How Cloud Integration Complicates SecurityKeynote - Cloudy Vision: How Cloud Integration Complicates Security
Keynote - Cloudy Vision: How Cloud Integration Complicates SecurityCloudVillage
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo OverviewBill Havanki
 

Similar to NYC HUG - Application Architectures with Apache Hadoop (20)

Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Oracle_Patching_Untold_Story_Final_Part2.pdf
Oracle_Patching_Untold_Story_Final_Part2.pdfOracle_Patching_Untold_Story_Final_Part2.pdf
Oracle_Patching_Untold_Story_Final_Part2.pdf
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Azure Service Fabric Mesh
Azure Service Fabric MeshAzure Service Fabric Mesh
Azure Service Fabric Mesh
 
Keynote - Cloudy Vision: How Cloud Integration Complicates Security
Keynote - Cloudy Vision: How Cloud Integration Complicates SecurityKeynote - Cloudy Vision: How Cloud Integration Complicates Security
Keynote - Cloudy Vision: How Cloud Integration Complicates Security
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
 

More from markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 

More from markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 

Recently uploaded

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 

NYC HUG - Application Architectures with Apache Hadoop

  • 1. 1 ApplicaAon Architectures with Apache Headline Hadoop Goes Here Mark Grover | @mark_grover Speaker Name or Subhead Goes Here NYC HUG slideshare.com/markgrover October 14th, 2014 DO NOT USE PUBLICLY PRIOR TO 10/23/12 ©2014 Cloudera, Inc. All Rights Reserved.
  • 2. About Me • CommiPer on Apache Bigtop, commiPer and PPMC member on Apache Sentry (incubaAng). • Contributor to Hadoop, Hive, Spark, Sqoop, Flume. • SoWware developer at Cloudera • @mark_grover 2 ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. Co-­‐authoring O’Reilly book • @hadooparchbook • hadooparchitecturebook.com • Strata Hadoop World Tutorial • at 9 AM tomorrow ©2014 Cloudera, Inc. All Rights 3 Reserved.
  • 4. 4 Case Study Click Stream Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. AnalyAcs ©2014 Cloudera, Inc. All Rights 5 Reserved.
  • 6. AnalyAcs ©2014 Cloudera, Inc. All Rights 6 Reserved.
  • 7. AnalyAcs ©2014 Cloudera, Inc. All Rights 7 Reserved.
  • 8. AnalyAcs ©2014 Cloudera, Inc. All Rights 8 Reserved.
  • 9. AnalyAcs ©2014 Cloudera, Inc. All Rights 9 Reserved.
  • 10. AnalyAcs ©2014 Cloudera, Inc. All Rights 10 Reserved.
  • 11. AnalyAcs ©2014 Cloudera, Inc. All Rights 11 Reserved.
  • 12. Web Logs – Combined Log Format 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” ©2014 Cloudera, Inc. All Rights Reserved. 12
  • 13. Clickstream AnalyAcs ©2014 Cloudera, Inc. All Rights Reserved. 244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ 36.0.1944.0 Safari/537.36” 13
  • 14. Similar use-­‐cases • Sensors – heart, agriculture, etc. • Casinos – session of a person at a table ©2014 Cloudera, Inc. All Rights 14 Reserved.
  • 15. Challenges of Hadoop ImplementaAon ©2014 Cloudera, Inc. All Rights 15 Reserved.
  • 16. Challenges of Hadoop ImplementaAon ©2014 Cloudera, Inc. All Rights 16 Reserved.
  • 17. Other challenges -­‐ Architectural ConsideraAons • Storage managers? • HDFS? HBase? • Data storage and modeling: • File formats? Compression? Schema design? • Data movement • How do we actually get the data into Hadoop? How do we get it out? • Metadata • How do we manage data about the data? • Processing • How can we transform it? How do we query it? • OrchestraAon • How do we manage the workflow for all of this? ©2014 Cloudera, Inc. All Rights 17 Reserved.
  • 18. 18 2. Processing Since that’s all what the Ame allows todayJ ©2014 Cloudera, Inc. All Rights Reserved.
  • 19. Processing • De-­‐duplicaAon • Filtering • SessionizaAon 19 ©2014 Cloudera, Inc. All Rights Reserved.
  • 20. DeduplicaAon – remove duplicate records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” ©2014 Cloudera, Inc. All Rights 20 Reserved.
  • 21. Filtering – filter out invalid records 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U… ©2014 Cloudera, Inc. All Rights 21 Reserved.
  • 22. SessionizaAon Website visit ©2014 Cloudera, Inc. All Rights 22 Reserved. Visitor 1 Session 1 Visitor 1 Session 2 Visitor 2 Session 1 > 30 minutes
  • 23. Why sessionize? Helps answers quesAons like: • What is my website’s bounce rate? • i.e. how many % of visitors don’t go past the landing page? • Which markeAng channels (e.g. organic search, display ad, etc.) are leading to most sessions? • Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.) • Do aPribuAon analysis – which channels are responsible for most conversions? 23 ©2014 Cloudera, Inc. All Rights Reserved.
  • 24. SessionizaAon 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166 ©2014 Cloudera, Inc. All Rights 24 Reserved.
  • 25. How to Sessionize? 1. Given a list of clicks, determine which clicks came from the same user 2. Given a parAcular user's clicks, determine if a given click is a part of a new session or a conAnuaAon of the previous session 25 ©2014 Cloudera, Inc. All Rights Reserved.
  • 26. #1 – Which clicks are from same user? • We can use: • IP address (244.157.45.12) • Cookies (A9A3BECE0563982D) • IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36") 26 ©2014 Cloudera, Inc. All Rights Reserved.
  • 27. #1 – Which clicks are from same user? • We can use: • IP address (244.157.45.12) • Cookies (A9A3BECE0563982D) • IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36") 27 ©2014 Cloudera, Inc. All Rights Reserved.
  • 28. #1 – Which clicks are from same user? 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” ©2014 Cloudera, Inc. All Rights 28 Reserved.
  • 29. #2 – Which clicks part of the same session? 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” ©2014 Cloudera, Inc. All Rights 29 Reserved. > 30 mins apart = different sessions
  • 30. 30 SessionizaAon in MapReduce github.com/hadooparchitecturebook/clickstream-­‐tutorial ©2014 Cloudera, Inc. All Rights Reserved.
  • 31. SessionizaAon in MapReduce log IP2, log log lines 31 ©2014 Cloudera, Inc. All Rights Reserved. Map Reduce Reduce Log line IP1, log lines IP1, lines Log line, session ID Map Map Log line Log line IP2, lines Log line, session ID
  • 32. Mapper for SessionizaAon 32 ©2014 Cloudera, Inc. All Rights Reserved. public static class SessionizeMapper extends Mapper<Object, Text, IpTimestampKey, Text> { private Matcher logRecordMatcher; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { logRecordMatcher = logRecordPattern.matcher(value.toString()); // We only emit something out if the record matches with our regex. Otherwise, we assume the record is busted and simply ignore it if (logRecordMatcher.matches()) { String ip = logRecordMatcher.group(1); DateTime timestamp = DateTime.parse(logRecordMatcher.group(2), TIMESTAMP_FORMATTER); Long unixTimestamp = timestamp.getMillis(); IpTimestampKey outputKey = new IpTimestampKey(ip, unixTimestamp); context.write(outputKey, value); } } }
  • 33. Reducer for SessionizaAon public static class SessionizeReducer 33 ©2014 Cloudera, Inc. All Rights Reserved. extends Reducer<IpTimestampKey, Text, IpTimestampKey, Text> { private Text result = new Text(); public void reduce(IpTimestampKey key, Iterable<Text> values, Context context ) throws IOException, InterruptedException { // The sessionId generated here is per day, per IP. So, any queries // that will be done as if this session ID were global, would require // a combination of the day in question and IP as well. String sessionId = null; Long lastTimeStamp = null; for (Text value : values) { String logRecord = value.toString();
  • 34. Reducer for SessionizaAon // If this is the first record for this user or it's been more than the 34 ©2014 Cloudera, Inc. All Rights Reserved. timeout since // the last click from this user, let's increment the session ID. if (lastTimeStamp == null || (key.getUnixTimestamp() -­‐ lastTimeStamp > SESSION_TIMEOUT_IN_MS)) { sessionId = key.getIp() + "+" + key.getUnixTimestamp(); } lastTimeStamp = key.getUnixTimestamp(); result.set(logRecord + " " + sessionId); // Since we only care about printing out the entire record in the result, with session ID appended // at the end, we just emit out "null" for the key context.write(null, result); } } }
  • 35. Secondary sorAng – by Amestamp • Need records to reducer to be grouped by IP address and sorted by Amestamp – a concept called secondary sor/ng • Instead of using just IP address as map output key and reduce input key • We use a composite key (IP, Amestamp) as map output key and reduce input key 35 ©2014 Cloudera, Inc. All Rights Reserved.
  • 36. Secondary sorAng – vocabulary • Composite key – IP address, Amestamp • Natural key – IP address • Secondary sort key -­‐ Amestamp 36 ©2014 Cloudera, Inc. All Rights Reserved.
  • 37. Secondary sorAng • Custom Grouping Comparator – on Natural Key (IP) • Custom Sort Comparator – on Composite Key (IP, address) • Custom ParAAoner – on Natural Key (IP) job.setGroupingComparatorClass(NaturalKeyComparator.class ); job.setSortComparatorClass(CompositeKeyComparator.class); job.setPartitionerClass(NaturalKeyPartitioner.class); 37 ©2014 Cloudera, Inc. All Rights Reserved.
  • 38. 38 Final Architecture ©2014 Cloudera, Inc. All Rights Reserved.
  • 39. Hadoop Cluster ©2014 Cloudera, Inc. All Rights 39 Reserved. BI/VisualizaAon tool (e.g. microstrategy) BI Analysts Spark For machine learning and graph processing R/Python StaAsAcal Analysis Custom Apps 3. Accessing 2. Processing 4. OrchestraAon Website users CRM System 1. IngesAon via Oozie OperaAonal Data Store Via Sqoop Web servers
  • 40. Final Architecture – High Level Overview 40 Data Sources IngesAon Data Storage/ Processing Data ReporAng/ Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 41. Final Architecture – High Level Overview 41 Data Sources IngesAon Data Storage/ Processing Data ReporAng/ Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 42. Final Architecture – IngesAon 42 Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Web App Avro Agent Flume Agent Flume Agent Flume Agent Flume Agent Fan-­‐in PaPern MulA Agents for Failover and rolling restarts HDFS ©2014 Cloudera, Inc. All Rights Reserved.
  • 43. Final Architecture – High Level Overview 43 Data Sources IngesAon Data Storage/ Processing Data ReporAng/ Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 44. Final Architecture – Storage and Processing /etl/weblogs/20140331/ /etl/weblogs/20140401/ … 44 Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aPribuAon/ … ©2014 Cloudera, Inc. All Rights Reserved.
  • 45. Final Architecture – High Level Overview 45 Data Sources IngesAon Data Storage/ Processing Data ReporAng/ Analysis ©2014 Cloudera, Inc. All Rights Reserved.
  • 46. Final Architecture – Data Access 46 Hive/ Impala BI/ AnalyAcs Tools JDBC/ODBC DWH Sqoop Local Disk R, etc. DB import tool ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. Contact info • Mark Grover • @mark_grover • www.linkedin.com/in/grovermark • Slides at slideshare.net/markgrover 47 ©2014 Cloudera, Inc. All Rights Reserved.
  • 48. 48 ©2014 Cloudera, Inc. All Rights Reserved.