Application architectures with Hadoop and Sessionization in MR

M
1
Headline	
  Goes	
  Here	
  
Speaker	
  Name	
  or	
  Subhead	
  Goes	
  Here	
  
DO	
  NOT	
  USE	
  PUBLICLY	
  
PRIOR	
  TO	
  10/23/12	
  
ApplicaAon	
  Architectures	
  with	
  
Apache	
  Hadoop	
  
Mark	
  Grover	
  |	
  @mark_grover	
  
East	
  Bay	
  JUG	
  
hadooparchitecturebook.com	
  
June	
  25th,	
  2014	
  
©2014 Cloudera, Inc. All Rights
Reserved.
About	
  Me	
  
•  CommiTer	
  on	
  Apache	
  Bigtop,	
  commiTer	
  and	
  PPMC	
  member	
  
on	
  Apache	
  Sentry	
  (incubaAng).	
  
•  Contributor	
  to	
  Hadoop,	
  Hive,	
  Spark,	
  Sqoop,	
  Flume.	
  
•  SoYware	
  developer	
  at	
  Cloudera	
  
•  @mark_grover	
  
2
©2014 Cloudera, Inc. All Rights
Reserved.
Co-­‐authoring	
  O’Reilly	
  book	
  
•  @hadooparchbook	
  
•  hadooparchitecturebook.com	
  
©2014 Cloudera, Inc. All Rights
Reserved.
3
What	
  is	
  Apache	
  Hadoop?	
  
4
Has	
  the	
  Flexibility	
  to	
  Store	
  and	
  
Mine	
  Any	
  Type	
  of	
  Data	
  
	
  
§  Ask	
  quesAons	
  across	
  structured	
  and	
  
unstructured	
  data	
  that	
  were	
  previously	
  
impossible	
  to	
  ask	
  or	
  solve	
  
§  Not	
  bound	
  by	
  a	
  single	
  schema	
  
Excels	
  at	
  
Processing	
  Complex	
  Data	
  
	
  
§  Scale-­‐out	
  architecture	
  divides	
  workloads	
  
across	
  mulAple	
  nodes	
  
§  Flexible	
  file	
  system	
  eliminates	
  ETL	
  
boTlenecks	
  
Scales	
  
Economically	
  
	
  
§  Can	
  be	
  deployed	
  on	
  commodity	
  
hardware	
  
§  Open	
  source	
  plaaorm	
  guards	
  against	
  
vendor	
  lock	
  
Hadoop	
  Distributed	
  
File	
  System	
  (HDFS)	
  
	
  
Self-­‐Healing,	
  High	
  
Bandwidth	
  Clustered	
  
Storage	
  
	
  
	
  
MapReduce	
  
	
  
Distributed	
  CompuAng	
  
Framework	
  
Apache Hadoop	
  is	
  an	
  open	
  source	
  
plaaorm	
  for	
  data	
  storage	
  and	
  processing	
  
that	
  is…	
  
ü  Scalable	
  
ü  Fault	
  tolerant	
  
ü  Distributed	
  
CORE	
  HADOOP	
  SYSTEM	
  COMPONENTS	
  
©2013 Cloudera, Inc. All Rights
Reserved.
5
Click	
  Stream	
  Analysis	
  
Case	
  Study	
  
©2014 Cloudera, Inc. All Rights
Reserved.
AnalyAcs	
  
©2014 Cloudera, Inc. All Rights
Reserved.
6	
  
Web	
  Logs	
  
©2014 Cloudera, Inc. All Rights
Reserved.
7	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”
Clickstream	
  AnalyAcs	
  
©2014 Cloudera, Inc. All Rights
Reserved.
8	
  
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/
36.0.1944.0 Safari/537.36”
Click	
  Stream	
  Analysis	
  (Before	
  Hadoop)	
  
9	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Web	
  logs	
  
ODS	
  
Data	
  
Warehouse	
  
Query	
  Extract	
  
Transform	
  
Load	
  
Business	
  
Intelligence	
  
Transform	
  
The	
  Problems	
  (Before	
  Hadoop)	
  
10	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Web	
  logs	
  
ODS	
  
Data	
  
Warehouse	
  
Query	
  Extract	
  
Transform	
  
Load	
  
Business	
  
Intelligence	
  
Transform	
  
1	
  
1	
  
1	
  
Slow	
  Data	
  TransformaAons	
  =	
  Missed	
  ETL	
  SLAs.	
  
2	
  
2	
  
Slow	
  Queries	
  =	
  Frustrated	
  Business	
  Users.	
  
Click	
  Stream	
  Analysis	
  (AYer	
  Hadoop)	
  
11	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Web	
  logs	
  
ODS	
  
Data	
  
Warehouse	
  
Query	
  Extract	
  
Transform	
  
Load	
  
Business	
  
Intelligence	
  
Transform	
  X	
  X	
  
Web	
  logs	
  
ODS	
  
Business	
  
Intelligence	
  
Query	
  
Click	
  Stream	
  Analysis	
  (AYer	
  Hadoop)	
  
12	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Transform	
  
Web	
  logs	
  
ODS	
  
Business	
  
Intelligence	
  
Query	
  
Click	
  Stream	
  Analysis	
  (AYer	
  Hadoop)	
  
13	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Transform	
  
Flume	
  or	
  Sqoop?	
  
Flume	
  or	
  Sqoop?	
  
Hive/Impala/MR?	
  
Hive/Impala/Spark?	
  
Challenges	
  of	
  Hadoop	
  ImplementaAon	
  
14	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Challenges	
  of	
  Hadoop	
  ImplementaAon	
  
15	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Other	
  challenges	
  -­‐	
  Architectural	
  ConsideraAons	
  	
  
•  Storage	
  managers?	
  
•  HDFS?	
  HBase?	
  
•  Data	
  storage	
  and	
  modeling:	
  
•  File	
  formats?	
  Compression?	
  Schema	
  design?	
  
•  Data	
  movement	
  
•  How	
  do	
  we	
  actually	
  get	
  the	
  data	
  into	
  Hadoop?	
  How	
  do	
  we	
  get	
  it	
  out?	
  
•  Metadata	
  
•  How	
  do	
  we	
  manage	
  data	
  about	
  the	
  data?	
  
•  Data	
  access	
  and	
  processing	
  
•  How	
  will	
  the	
  data	
  be	
  accessed	
  once	
  in	
  Hadoop?	
  How	
  can	
  we	
  transform	
  it?	
  How	
  do	
  
we	
  query	
  it?	
  
•  OrchestraAon	
  
•  How	
  do	
  we	
  manage	
  the	
  workflow	
  for	
  all	
  of	
  this?	
  
16
©2014 Cloudera, Inc. All Rights
Reserved.
17
High	
  level	
  Design	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Clickstream	
  AnalyAcs	
  
©2014 Cloudera, Inc. All Rights
Reserved.
18	
  
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/
36.0.1944.0 Safari/537.36”
©2014 Cloudera, Inc. All Rights
Reserved.
19	
  
Hadoop	
  
Cluster	
  
BI/VisualizaAon	
  
tool	
  (e.g.	
  
microstrategy)	
  
BI	
  
Analysts	
  
Spark	
   For	
  machine	
  learning	
  
and	
  graph	
  processing	
  
R/Python	
   StaAsAcal	
  Analysis	
  
Custom	
  
Apps	
  
3.	
  Accessing	
  
2.	
  Processing	
  
4.	
  OrchestraAon	
  
via	
  Oozie	
  1.	
  IngesAon	
  
OperaAonal	
  
Data	
  Store	
  
CRM	
  System	
  
Via	
  Sqoop	
  
Web	
  servers	
  
Website	
  users	
  
20
Since	
  that’s	
  of	
  most	
  interest	
  to	
  this	
  audience	
  
2.	
  Processing	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Processing	
  
•  De-­‐duplicaAon	
  
•  Filtering	
  
•  SessionizaAon	
  
21
©2014 Cloudera, Inc. All Rights
Reserved.
DeduplicaAon	
  –	
  remove	
  duplicate	
  records	
  
©2014 Cloudera, Inc. All Rights
Reserved.
22	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
Filtering	
  –	
  filter	
  out	
  invalid	
  records	
  
©2014 Cloudera, Inc. All Rights
Reserved.
23	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U…
SessionizaAon	
  
©2014 Cloudera, Inc. All Rights
Reserved.
24	
  
Website	
  visit	
  
Visitor	
  1	
  
Session	
  1	
  
Visitor	
  1	
  
Session	
  2	
  
Visitor	
  2	
  
Session	
  1	
  
>	
  30	
  minutes	
  
Why	
  sessionize?	
  
Helps	
  answers	
  quesAons	
  like:	
  
•  What	
  is	
  my	
  website’s	
  bounce	
  rate?	
  
•  i.e.	
  how	
  many	
  %	
  of	
  visitors	
  don’t	
  go	
  past	
  the	
  landing	
  page?	
  
•  Which	
  markeAng	
  channels	
  (e.g.	
  organic	
  search,	
  display	
  ad,	
  etc.)	
  
are	
  leading	
  to	
  most	
  sessions?	
  
•  Which	
  ones	
  of	
  those	
  lead	
  to	
  most	
  conversions	
  (e.g.	
  people	
  
buying	
  things,	
  signing	
  up,	
  etc.)	
  
•  Do	
  aTribuAon	
  analysis	
  –	
  which	
  channels	
  are	
  responsible	
  for	
  
most	
  conversions?	
  
25
©2014 Cloudera, Inc. All Rights
Reserved.
SessionizaAon	
  
©2014 Cloudera, Inc. All Rights
Reserved.
26	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1” 166
How	
  to	
  Sessionize?	
  
1.  Given	
  a	
  list	
  of	
  clicks,	
  determine	
  which	
  clicks	
  came	
  from	
  the	
  
same	
  user	
  
2.  Given	
  a	
  parAcular	
  user's	
  clicks,	
  determine	
  if	
  a	
  given	
  click	
  is	
  a	
  
part	
  of	
  a	
  new	
  session	
  or	
  a	
  conAnuaAon	
  of	
  the	
  previous	
  
session	
  
27
©2014 Cloudera, Inc. All Rights
Reserved.
#1	
  –	
  Which	
  clicks	
  are	
  from	
  same	
  user?	
  
•  We	
  can	
  use:	
  
•  IP	
  address	
  (244.157.45.12)	
  
•  Cookies	
  (A9A3BECE0563982D)	
  
•  IP	
  address	
  (244.157.45.12)and	
  user	
  agent	
  string	
  ((KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/
537.36")	
  
28
©2014 Cloudera, Inc. All Rights
Reserved.
#1	
  –	
  Which	
  clicks	
  are	
  from	
  same	
  user?	
  
•  We	
  can	
  use:	
  
•  IP	
  address	
  (244.157.45.12)	
  
•  Cookies	
  (A9A3BECE0563982D)	
  
•  IP	
  address	
  (244.157.45.12)and	
  user	
  agent	
  string	
  ((KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/
537.36")	
  
29
©2014 Cloudera, Inc. All Rights
Reserved.
#1	
  –	
  Which	
  clicks	
  are	
  from	
  same	
  user?	
  
©2014 Cloudera, Inc. All Rights
Reserved.
30	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”
#2	
  –	
  Which	
  clicks	
  	
  part	
  of	
  the	
  same	
  session?	
  
©2014 Cloudera, Inc. All Rights
Reserved.
31	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”
>	
  30	
  mins	
  apart	
  =	
  different	
  
sessions	
  
32
Intro	
  to	
  MapReduce	
  
©2014 Cloudera, Inc. All Rights
Reserved.
MapReduce	
  
•  Map	
  –	
  Apply	
  a	
  funcAon	
  to	
  each	
  input	
  record	
  
•  Shuffle	
  &	
  Sort	
  –	
  ParAAon	
  the	
  map	
  output	
  and	
  sort	
  each	
  
parAAon	
  
•  Reduce	
  –	
  Apply	
  aggregaAon	
  funcAon	
  to	
  all	
  values	
  in	
  each	
  
parAAon	
  
33
©2014 Cloudera, Inc. All Rights
Reserved.
34
SessionizaAon	
  in	
  MapReduce	
  
©2014 Cloudera, Inc. All Rights
Reserved.
35
github.com/hadooparchitecturebook/hadoop-­‐arch-­‐book/tree/master/ch06/MRSessionize	
  
SessionizaAon	
  in	
  MapReduce	
  
©2014 Cloudera, Inc. All Rights
Reserved.
SessionizaAon	
  in	
  MapReduce	
  
36
©2014 Cloudera, Inc. All Rights
Reserved.
Map	
  
Reduce	
  
Reduce	
  
Log	
  line	
  
IP1,	
  log	
  lines	
  
IP1,	
  log	
  lines	
  
Log	
  line,	
  session	
  ID	
  
Map	
  
Map	
  
Log	
  line	
  
Log	
  line	
   IP2,	
  log	
  lines	
  
IP2,	
  log	
  lines	
   Log	
  line,	
  session	
  ID	
  
Mapper	
  for	
  SessionizaAon	
  
37
©2014 Cloudera, Inc. All Rights
Reserved.
	
  public	
  static	
  class	
  SessionizeMapper	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  extends	
  Mapper<Object,	
  Text,	
  IpTimestampKey,	
  Text>	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  Matcher	
  logRecordMatcher;	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  public	
  void	
  map(Object	
  key,	
  Text	
  value,	
  Context	
  context	
  
	
  	
  	
  	
  	
  	
  	
  	
  )	
  throws	
  IOException,	
  InterruptedException	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  logRecordMatcher	
  =	
  logRecordPattern.matcher(value.toString());	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  We	
  only	
  emit	
  something	
  out	
  if	
  the	
  record	
  matches	
  with	
  our	
  regex.	
  Otherwise,	
  we	
  
assume	
  the	
  record	
  is	
  busted	
  and	
  simply	
  ignore	
  it	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  (logRecordMatcher.matches())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  String	
  ip	
  =	
  logRecordMatcher.group(1);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  DateTime	
  timestamp	
  =	
  DateTime.parse(logRecordMatcher.group(2),	
  
TIMESTAMP_FORMATTER);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Long	
  unixTimestamp	
  =	
  timestamp.getMillis();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  IpTimestampKey	
  outputKey	
  =	
  new	
  IpTimestampKey(ip,	
  unixTimestamp);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(outputKey,	
  value);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }	
  
Reducer	
  for	
  SessionizaAon	
  
38
©2014 Cloudera, Inc. All Rights
Reserved.
	
  public	
  static	
  class	
  SessionizeReducer	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  extends	
  Reducer<IpTimestampKey,	
  Text,	
  IpTimestampKey,	
  Text>	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  Text	
  result	
  =	
  new	
  Text();	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  static	
  int	
  sessionId	
  =	
  0;	
  
	
  	
  	
  	
  	
  	
  	
  	
  private	
  Long	
  lastTimeStamp	
  =	
  null;	
  
	
  	
  	
  	
  	
  	
  	
  	
  public	
  void	
  reduce(IpTimestampKey	
  key,	
  Iterable<Text>	
  values,	
  Context	
  context	
  
	
  	
  	
  	
  	
  	
  	
  	
  )	
  throws	
  IOException,	
  InterruptedException	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  for	
  (Text	
  value	
  :	
  values)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  String	
  logRecord	
  =	
  value.toString();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  If	
  this	
  is	
  the	
  first	
  record	
  for	
  this	
  user	
  or	
  it's	
  been	
  more	
  than	
  the	
  timeout	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  since	
  the	
  last	
  click	
  from	
  this	
  user,	
  let's	
  increment	
  the	
  session	
  ID.	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  (lastTimeStamp	
  ==	
  null	
  ||	
  (key.getUnixTimestamp()	
  -­‐	
  lastTimeStamp	
  >	
  
SESSION_TIMEOUT_IN_MS))	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sessionId++;	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  lastTimeStamp	
  =	
  key.getUnixTimestamp();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  result.set(logRecord	
  +	
  "	
  "	
  +	
  sessionId);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  Since	
  we	
  only	
  care	
  about	
  printing	
  out	
  the	
  entire	
  record	
  in	
  the	
  result,	
  with	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  //	
  session	
  ID	
  appended	
  at	
  the	
  end,	
  we	
  just	
  emit	
  out	
  "null"	
  for	
  the	
  key	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(null,	
  result);	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }	
  
Secondary	
  sorAng	
  –	
  by	
  Amestamp	
  
•  Need	
  records	
  to	
  reducer	
  to	
  be	
  grouped	
  by	
  IP	
  address	
  and	
  
sorted	
  by	
  Amestamp	
  –	
  a	
  concept	
  called	
  secondary	
  sor/ng	
  
•  Instead	
  of	
  using	
  just	
  IP	
  address	
  as	
  map	
  output	
  key	
  and	
  reduce	
  
input	
  key	
  
•  We	
  use	
  a	
  composite	
  key	
  (IP,	
  Amestamp)	
  as	
  map	
  output	
  key	
  and	
  
reduce	
  input	
  key	
  
39
©2014 Cloudera, Inc. All Rights
Reserved.
Secondary	
  sorAng	
  –	
  vocabulary	
  
•  Composite	
  key	
  –	
  IP	
  address,	
  Amestamp	
  
•  Natural	
  key	
  –	
  IP	
  address	
  
•  Secondary	
  sort	
  key	
  -­‐	
  Amestamp	
  
40
©2014 Cloudera, Inc. All Rights
Reserved.
Secondary	
  sorAng	
  
•  Custom	
  Grouping	
  Comparator	
  –	
  on	
  Natural	
  Key	
  (IP)	
  
•  Custom	
  Sort	
  Comparator	
  –	
  on	
  Composite	
  Key	
  (IP,	
  address)	
  
•  Custom	
  ParAAoner	
  –	
  on	
  Natural	
  Key	
  (IP)	
  
	
  
job.setGroupingComparatorClass(NaturalKeyComparator.class
);	
  	
  	
  	
  
job.setSortComparatorClass(CompositeKeyComparator.class);	
  
job.setPartitionerClass(NaturalKeyPartitioner.class);	
  	
  
41
©2014 Cloudera, Inc. All Rights
Reserved.
42
Data	
  Storage	
  and	
  Modeling	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  Storage	
  Manager	
  consideraAons	
  
•  Popular	
  storage	
  managers	
  for	
  Hadoop	
  
•  Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  
•  HBase	
  
43
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  HDFS	
  vs	
  HBase	
  
HDFS	
  
•  Stores	
  data	
  directly	
  as	
  files	
  
•  Fast	
  scans	
  
•  Poor	
  random	
  reads/writes	
  
HBase	
  
•  Stores	
  data	
  as	
  Hfiles	
  on	
  HDFS	
  
•  Slow	
  scans	
  
•  Fast	
  random	
  reads/writes	
  
44	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  Storage	
  Manager	
  consideraAons	
  
•  We	
  choose	
  HDFS	
  
•  AnalyAcal	
  needs	
  in	
  this	
  case	
  served	
  beTer	
  by	
  fast	
  scans.	
  
45
©2014 Cloudera, Inc. All Rights
Reserved.
46
Data	
  Storage	
  Format	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  Format	
  ConsideraAons	
  	
  
•  Store	
  as	
  plain	
  text?	
  
•  Sure,	
  well	
  supported	
  by	
  Hadoop.	
  
•  Text	
  can	
  easily	
  be	
  processed	
  by	
  MapReduce,	
  loaded	
  into	
  Hive	
  for	
  
analysis,	
  and	
  so	
  on.	
  
•  But…	
  
•  Will	
  begin	
  to	
  consume	
  lots	
  of	
  space	
  in	
  HDFS.	
  
•  May	
  not	
  be	
  opAmal	
  for	
  processing	
  by	
  tools	
  in	
  the	
  Hadoop	
  
ecosystem.	
  
47
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  Format	
  ConsideraAons	
  	
  
•  But,	
  we	
  can	
  compress	
  the	
  text	
  files…	
  
•  Gzip	
  –	
  supported	
  by	
  Hadoop,	
  but	
  not	
  spliTable.	
  
•  Bzip2	
  –	
  hey,	
  spliTable!	
  Great	
  compression!	
  But	
  decompression	
  is	
  
slooowww.	
  
•  LZO	
  –	
  spliTable	
  (with	
  some	
  work),	
  good	
  compress/de-­‐compress	
  
performance.	
  Good	
  choice	
  for	
  storing	
  text	
  files	
  on	
  Hadoop.	
  	
  
•  Snappy	
  –	
  provides	
  a	
  good	
  tradeoff	
  between	
  size	
  and	
  speed.	
  	
  
48
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Storage	
  –	
  More	
  About	
  Snappy	
  
•  Designed	
  at	
  Google	
  to	
  provide	
  high	
  compression	
  speeds	
  with	
  
reasonable	
  compression.	
  
•  Not	
  the	
  highest	
  compression,	
  but	
  provides	
  very	
  good	
  performance	
  
for	
  processing	
  on	
  Hadoop.	
  
•  Snappy	
  is	
  not	
  spliTable	
  though,	
  which	
  brings	
  us	
  to…	
  
	
  
49
©2014 Cloudera, Inc. All Rights
Reserved.
SequenceFile	
  
• Stores	
  records	
  as	
  binary	
  
key/value	
  pairs.	
  
• SequenceFile	
  “blocks”	
  
can	
  be	
  compressed.	
  
• This	
  enables	
  spliTability	
  
with	
  non-­‐spliTable	
  
compression.	
  	
  	
  
50	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Avro	
  
•  Kinda	
  SequenceFile	
  on	
  
Steroids.	
  
•  Self-­‐documenAng	
  –	
  stores	
  
schema	
  in	
  header.	
  
•  Provides	
  very	
  efficient	
  
storage.	
  
•  Supports	
  spliTable	
  
compression.	
  
51	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Our	
  Format	
  Choices…	
  
•  Avro	
  with	
  Snappy	
  
•  Snappy	
  provides	
  opAmized	
  compression.	
  
•  Avro	
  provides	
  compact	
  storage,	
  self-­‐documenAng	
  files,	
  and	
  
supports	
  schema	
  evoluAon.	
  
•  Avro	
  also	
  provides	
  beTer	
  failure	
  handling	
  than	
  other	
  choices.	
  
•  SequenceFiles	
  would	
  also	
  be	
  a	
  good	
  choice,	
  and	
  are	
  directly	
  
supported	
  by	
  ingesAon	
  tools	
  in	
  the	
  ecosystem.	
  
•  But	
  only	
  supports	
  Java.	
  
52
©2014 Cloudera, Inc. All Rights
Reserved.
53
HDFS	
  Schema	
  Design	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Recommended	
  HDFS	
  Schema	
  Design	
  
•  How	
  to	
  lay	
  out	
  data	
  on	
  HDFS?	
  
54
©2014 Cloudera, Inc. All Rights
Reserved.
Recommended	
  HDFS	
  Schema	
  Design	
  
/user/<username>	
  -­‐	
  User	
  specific	
  data,	
  jars,	
  conf	
  files	
  
/etl	
  –	
  Data	
  in	
  various	
  stages	
  of	
  ETL	
  workflow	
  
/tmp	
  –	
  temp	
  data	
  from	
  tools	
  or	
  shared	
  between	
  users	
  
/data	
  –	
  shared	
  data	
  for	
  the	
  enAre	
  organizaAon	
  
/app	
  –	
  Everything	
  but	
  data:	
  UDF	
  jars,	
  HQL	
  files,	
  Oozie	
  workflows	
  
55
©2014 Cloudera, Inc. All Rights
Reserved.
56
Advanced	
  HDFS	
  Schema	
  Design	
  
©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  ParAAoning?	
  
57
dataset	
  
	
  	
  	
  col=val1/file.txt	
  
	
  	
  	
  col=val2/file.txt	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  col=valn/file.txt	
  
dataset	
  
	
  	
  file1.txt	
  
	
  	
  file2.txt	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  filen.txt	
  
Un-­‐parAAoned	
  HDFS	
  
directory	
  structure	
  
ParAAoned	
  HDFS	
  directory	
  
structure	
  
©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  ParAAoning?	
  
58
clicks	
  
	
  	
  	
  dt=2014-­‐01-­‐01/clicks.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐02/clicks.txt	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  dt=2014-­‐03-­‐31/clicks.txt	
  
clicks	
  
	
  	
  clicks-­‐2014-­‐01-­‐01.txt	
  
	
  	
  clicks-­‐2014-­‐01-­‐02.txt	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  	
  .	
  
	
  	
  	
  clicks-­‐2014-­‐03-­‐31.txt	
  
Un-­‐parAAoned	
  HDFS	
  
directory	
  structure	
  
ParAAoned	
  HDFS	
  directory	
  
structure	
  
©2014 Cloudera, Inc. All Rights
Reserved.
ParAAoning	
  
•  Split	
  the	
  dataset	
  into	
  smaller	
  consumable	
  chunks	
  
•  Rudimentary	
  form	
  of	
  “indexing”	
  
•  <data	
  set	
  name>/
<parAAon_column_name=parAAon_column_value>/{files}	
  
59
©2014 Cloudera, Inc. All Rights
Reserved.
ParAAoning	
  consideraAons	
  
•  What	
  column	
  to	
  bucket	
  by?	
  
•  HDFS	
  is	
  append	
  only.	
  
•  Don’t	
  have	
  too	
  many	
  parAAons	
  (<10,000)	
  
•  Don’t	
  have	
  too	
  many	
  small	
  files	
  in	
  the	
  parAAons	
  (more	
  than	
  
block	
  size	
  generally)	
  
•  We	
  decided	
  to	
  parAAon	
  by	
  /mestamp	
  
60
©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  buckeAng?	
  
61
clicks	
  
	
  	
  	
  dt=2014-­‐01-­‐01/clicks.txt	
  
	
  
	
  	
  	
  dt=2014-­‐01-­‐02/clicks.txt	
  
Un-­‐bucketed	
  HDFS	
  
directory	
  structure	
  
clicks	
  
	
  	
  	
  dt=2014-­‐01-­‐01/file0.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐01/file1.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐01/file2.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐01/file3.txt	
  
	
  
	
  	
  	
  dt=2014-­‐01-­‐02/file0.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐02/file1.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐02/file2.txt	
  
	
  	
  	
  dt=2014-­‐01-­‐02/file3.txt	
  
Bucketed	
  HDFS	
  directory	
  
structure	
  
©2014 Cloudera, Inc. All Rights
Reserved.
BuckeAng	
  
•  Hash-­‐bucketed	
  files	
  within	
  each	
  parAAon	
  based	
  on	
  a	
  parAcular	
  
column	
  
•  Useful	
  when	
  sampling	
  
•  In	
  some	
  joins,	
  pre-­‐reqs:	
  
•  Datasets	
  bucketed	
  on	
  the	
  same	
  key	
  as	
  the	
  join	
  key	
  
•  Number	
  of	
  buckets	
  are	
  the	
  same	
  or	
  one	
  is	
  a	
  mulAple	
  of	
  the	
  other	
  
62
©2014 Cloudera, Inc. All Rights
Reserved.
BuckeAng	
  consideraAons?	
  
•  Which	
  column	
  to	
  bucket	
  on?	
  
•  How	
  many	
  buckets?	
  
•  We	
  decided	
  to	
  bucket	
  based	
  on	
  cookie	
  
63
©2014 Cloudera, Inc. All Rights
Reserved.
De-­‐normalizing	
  consideraAons	
  
•  In	
  general,	
  big	
  data	
  joins	
  are	
  expensive	
  
•  When	
  to	
  de-­‐normalize?	
  
•  Decided	
  to	
  join	
  the	
  smaller	
  dimension	
  tables	
  
•  Big	
  fact	
  tables	
  are	
  sAll	
  joined	
  
64
©2014 Cloudera, Inc. All Rights
Reserved.
65
Data	
  IngesAon	
  
©2014 Cloudera, Inc. All Rights
Reserved.
File	
  Transfers	
  	
  
• “hadoop	
  fs	
  –put	
  <file>”	
  
• Reliable,	
  but	
  not	
  resilient	
  
to	
  failure.	
  
• Other	
  opAons	
  are	
  
mountable	
  HDFS,	
  for	
  
example	
  NFSv3.	
  
66	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Streaming	
  IngesAon	
  
•  Flume	
  
•  Reliable,	
  distributed,	
  and	
  available	
  system	
  for	
  efficient	
  collecAon,	
  
aggregaAon	
  and	
  movement	
  of	
  streaming	
  data,	
  e.g.	
  logs.	
  
•  Ka•a	
  
•  Reliable	
  and	
  distributed	
  publish-­‐subscribe	
  messaging	
  system.	
  
67
©2014 Cloudera, Inc. All Rights
Reserved.
Flume	
  vs.	
  Ka•a	
  
• Purpose	
  built	
  for	
  Hadoop	
  
data	
  ingest.	
  
• Pre-­‐built	
  sinks	
  for	
  HDFS,	
  
HBase,	
  etc.	
  
• Supports	
  transformaAon	
  
of	
  data	
  in-­‐flight.	
  
• General	
  pub-­‐sub	
  
messaging	
  framework.	
  
• Hadoop	
  not	
  supported,	
  
requires	
  3rd-­‐party	
  
component	
  (Camus).	
  
• Just	
  a	
  message	
  transport	
  
(a	
  very	
  fast	
  one).	
  
68	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Flume	
  vs.	
  Ka•a	
  
•  BoTom	
  line:	
  
•  Flume	
  very	
  well	
  integrated	
  with	
  Hadoop	
  ecosystem,	
  well	
  suited	
  
to	
  ingesAon	
  of	
  sources	
  such	
  as	
  log	
  files.	
  
•  Ka•a	
  is	
  a	
  highly	
  reliable	
  and	
  scalable	
  enterprise	
  messaging	
  
system,	
  and	
  great	
  for	
  scaling	
  out	
  to	
  mulAple	
  consumers.	
  
69
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Quick	
  IntroducAon	
  to	
  Flume	
  
70	
  
Flume	
  Agent	
  
Source	
   Channel	
   Sink	
   DesAnaAon	
  External	
  
Source	
  
Web	
  Server	
  
TwiTer	
  
JMS	
  
System	
  logs	
  
…	
  
Consumes	
  events	
  
and	
  forwards	
  to	
  
channels	
  
Stores	
  events	
  
unAl	
  consumed	
  
by	
  sinks	
  –	
  file,	
  
memory,	
  JDBC	
  
Removes	
  event	
  from	
  
channel	
  and	
  puts	
  
into	
  external	
  
desAnaAon	
  
JVM	
  	
  process	
  hosAng	
  components	
  
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Quick	
  IntroducAon	
  to	
  Flume	
  
•  Reliable	
  –	
  events	
  are	
  stored	
  in	
  channel	
  unAl	
  delivered	
  to	
  next	
  stage.	
  
•  Recoverable	
  –	
  events	
  can	
  be	
  persisted	
  to	
  disk	
  and	
  recovered	
  in	
  the	
  
event	
  of	
  failure.	
  
71
Flume	
  Agent	
  
Source	
   Channel	
   Sink	
   DesAnaAon	
  
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Quick	
  IntroducAon	
  to	
  Flume	
  
• DeclaraAve	
  	
  
•  No	
  coding	
  required.	
  
•  ConfiguraAon	
  specifies	
  
how	
  components	
  are	
  
wired	
  together.	
  
72	
  
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Brief	
  Discussion	
  of	
  Flume	
  PaTerns	
  –	
  Fan-­‐in	
  
• Flume	
  agent	
  runs	
  on	
  
each	
  of	
  our	
  servers.	
  
• These	
  agents	
  send	
  data	
  
to	
  mulAple	
  agents	
  to	
  
provide	
  reliability.	
  
• Flume	
  provides	
  support	
  
for	
  load	
  balancing.	
  
73	
  
©2014 Cloudera, Inc. All Rights
Reserved.
A	
  Brief	
  Discussion	
  of	
  Flume	
  PaTerns	
  –	
  Spli‚ng	
  
•  Common	
  need	
  is	
  to	
  split	
  
data	
  on	
  ingest.	
  
•  For	
  example:	
  
•  Sending	
  data	
  to	
  mulAple	
  
clusters	
  for	
  DR.	
  
•  To	
  mulAple	
  desAnaAons.	
  
•  Flume	
  also	
  supports	
  
parAAoning,	
  which	
  is	
  key	
  
to	
  our	
  implementaAon.	
  
74	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Sqoop	
  Overview	
  
•  Apache	
  project	
  designed	
  to	
  ease	
  import	
  and	
  export	
  of	
  data	
  
between	
  Hadoop	
  and	
  external	
  data	
  stores	
  such	
  as	
  relaAonal	
  
databases.	
  
•  Great	
  for	
  doing	
  bulk	
  imports	
  and	
  exports	
  of	
  data	
  between	
  
HDFS,	
  Hive	
  and	
  HBase	
  and	
  an	
  external	
  data	
  store.	
  Not	
  suited	
  
for	
  ingesAng	
  event	
  based	
  data.	
  
©2014 Cloudera, Inc. All Rights
Reserved.
75
IngesAon	
  Decisions	
  
•  Historical	
  Data	
  
•  Smaller	
  files:	
  file	
  transfer	
  
•  Larger	
  files:	
  Flume	
  with	
  spooling	
  directory	
  source.	
  
•  Incoming	
  Data	
  
•  Flume	
  with	
  the	
  spooling	
  directory	
  source.	
  
76
©2014 Cloudera, Inc. All Rights
Reserved.
77
Data	
  Processing	
  and	
  Access	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  flow	
  
78	
  
Raw	
  data	
  
ParAAoned	
  
clickstream	
  
data	
  
Other	
  data	
  
(Financial,	
  
CRM,	
  etc.)	
  
Aggregated	
  
dataset	
  #2	
  
Aggregated	
  
dataset	
  #1	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  processing	
  tools	
  
79	
  
•  Hive	
  
•  Impala	
  
•  Pig,	
  etc.	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Hive	
  
80	
  
•  Open	
  source	
  data	
  warehouse	
  system	
  for	
  Hadoop	
  
•  Converts	
  SQL-­‐like	
  queries	
  to	
  MapReduce	
  jobs	
  
•  Work	
  is	
  being	
  done	
  to	
  move	
  this	
  away	
  from	
  MR	
  
•  Stores	
  metadata	
  in	
  Hive	
  metastore	
  
•  Can	
  create	
  tables	
  over	
  HDFS	
  or	
  HBase	
  data	
  
•  Access	
  available	
  via	
  JDBC/ODBC	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Impala	
  
81	
  
•  Real-­‐Ame	
  open	
  source	
  SQL	
  query	
  engine	
  for	
  Hadoop	
  
•  Doesn’t	
  build	
  on	
  MapReduce	
  
•  WriTen	
  in	
  C++,	
  uses	
  LLVM	
  for	
  run-­‐Ame	
  code	
  generaAon	
  
•  Can	
  create	
  tables	
  over	
  HDFS	
  or	
  HBase	
  data	
  
•  Accesses	
  Hive	
  metastore	
  for	
  metadata	
  
•  Access	
  available	
  via	
  JDBC/ODBC	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Pig	
  
82	
  
•  Higher	
  level	
  abstracAon	
  over	
  MapReduce	
  (like	
  Hive)	
  
•  Write	
  transformaAons	
  in	
  scripAng	
  language	
  –	
  Pig	
  LaAn	
  
•  Can	
  access	
  Hive	
  metastore	
  via	
  HCatalog	
  for	
  metadata	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Data	
  Processing	
  consideraAons	
  
83	
  
•  We	
  chose	
  Hive	
  for	
  ETL	
  	
  and	
  Impala	
  for	
  interac/ve	
  BI.	
  
©2014 Cloudera, Inc. All Rights
Reserved.
84
Metadata	
  Management	
  
©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  Metadata?	
  
85	
  
•  Metadata	
  is	
  data	
  about	
  the	
  data	
  
•  Format	
  in	
  which	
  data	
  is	
  stored	
  
•  Compression	
  codec	
  
•  LocaAon	
  of	
  the	
  data	
  
•  Is	
  the	
  data	
  parAAoned/bucketed/sorted?	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Metadata	
  in	
  Hive	
  
86
Hive	
  
Metastore	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Metadata	
  
87	
  
•  Hive	
  metastore	
  has	
  become	
  the	
  de-­‐facto	
  metadata	
  repository	
  
•  HCatalog	
  makes	
  Hive	
  metastore	
  accessible	
  to	
  other	
  
applicaAons	
  (Pig,	
  MapReduce,	
  custom	
  apps,	
  etc.)	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Hive	
  +	
  HCatalog	
  
88	
  
©2014 Cloudera, Inc. All Rights
Reserved.
89
OrchestraAon	
  
©2014 Cloudera, Inc. All Rights
Reserved.
OrchestraAon	
  
•  Once	
  the	
  data	
  is	
  in	
  Hadoop,	
  we	
  need	
  a	
  way	
  to	
  manage	
  
workflows	
  in	
  our	
  architecture.	
  
•  Scheduling	
  and	
  tracking	
  MapReduce	
  jobs,	
  Hive	
  jobs,	
  etc.	
  
•  Several	
  opAons	
  here:	
  
•  Cron	
  
•  Oozie,	
  Azkaban	
  
•  3rd-­‐party	
  tools,	
  Talend,	
  Pentaho,	
  InformaAca,	
  enterprise	
  
schedulers.	
  
90
©2014 Cloudera, Inc. All Rights
Reserved.
Oozie	
  
• Supports	
  defining	
  and	
  
execuAng	
  a	
  sequence	
  of	
  
jobs.	
  
• Can	
  trigger	
  jobs	
  based	
  on	
  
external	
  dependencies	
  or	
  
schedules.	
  
91	
  
©2014 Cloudera, Inc. All Rights
Reserved.
92
Final	
  Architecture	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
93	
  
Data	
  
Sources	
  
IngesAon	
  
Data	
  
Storage/
Processing	
  
Data	
  
ReporAng/
Analysis	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
94	
  
Data	
  
Sources	
  
IngesAon	
  
Data	
  
Storage/
Processing	
  
Data	
  
ReporAng/
Analysis	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  IngesAon	
  
95	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Web	
  App	
   Avro	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Fan-­‐in	
  	
  
PaTern	
  
MulA	
  Agents	
  for	
  	
  
Failover	
  and	
  rolling	
  restarts	
  
HDFS	
  	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
96	
  
Data	
  
Sources	
  
IngesAon	
  
Data	
  
Storage/
Processing	
  
Data	
  
ReporAng/
Analysis	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  Storage	
  and	
  Processing	
  
97	
  
/etl/weblogs/20140331/	
  
/etl/weblogs/20140401/	
  
…	
  
Data	
  Processing	
  
/data/markeAng/clickstream/bouncerate/	
  
/data/markeAng/clickstream/aTribuAon/	
  
…	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
98	
  
Data	
  
Sources	
  
IngesAon	
  
Data	
  
Storage/
Processing	
  
Data	
  
ReporAng/
Analysis	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Final	
  Architecture	
  –	
  Data	
  Access	
  
99	
  
Hive/
Impala	
  
BI/
AnalyAcs	
  
Tools	
  
DWH	
  
Sqoop	
  
Local	
  
Disk	
  
R,	
  etc.	
  
DB	
  import	
  tool	
  
JDBC/ODBC	
  
©2014 Cloudera, Inc. All Rights
Reserved.
Contact	
  info	
  
•  Mark	
  Grover	
  
•  @mark_grover	
  
•  www.linkedin.com/in/grovermark	
  
•  Slides	
  at	
  slideshare.net/markgrover	
  
100
©2014 Cloudera, Inc. All Rights
Reserved.
101
©2014 Cloudera, Inc. All Rights
Reserved.
1 of 101

Recommended

Hadoop Application Architectures tutorial at Big DataService 2015 by
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
3.4K views160 slides
Application Architectures with Hadoop by
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
3.2K views64 slides
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial by
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
9.5K views157 slides
Strata EU tutorial - Architectural considerations for hadoop applications by
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
4.7K views157 slides
Hadoop Application Architectures tutorial - Strata London by
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
4.2K views159 slides
Architectural considerations for Hadoop Applications by
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
9.9K views158 slides

More Related Content

What's hot

Application Architectures with Hadoop - UK Hadoop User Group by
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
5.2K views63 slides
Intro to hadoop tutorial by
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
914 views90 slides
Architecting application with Hadoop - using clickstream analytics as an example by
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
1.8K views63 slides
Architecting applications with Hadoop - Fraud Detection by
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
6.5K views150 slides
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley by
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
1.2K views45 slides
NYC HUG - Application Architectures with Apache Hadoop by
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
703 views48 slides

What's hot(20)

Application Architectures with Hadoop - UK Hadoop User Group by hadooparchbook
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook5.2K views
Intro to hadoop tutorial by markgrover
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
markgrover914 views
Architecting application with Hadoop - using clickstream analytics as an example by hadooparchbook
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook1.8K views
Architecting applications with Hadoop - Fraud Detection by hadooparchbook
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook6.5K views
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley by markgrover
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover1.2K views
NYC HUG - Application Architectures with Apache Hadoop by markgrover
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
markgrover703 views
Streaming architecture patterns by hadooparchbook
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook5.4K views
Top 5 mistakes when writing Streaming applications by hadooparchbook
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook2K views
Architecting Applications with Hadoop by markgrover
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover765 views
Hadoop Application Architectures - Fraud Detection by hadooparchbook
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook3.1K views
Application architectures with hadoop – big data techcon 2014 by Jonathan Seidman
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman2.3K views
Hadoop application architectures - Fraud detection tutorial by hadooparchbook
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook2K views
Applications on Hadoop by markgrover
Applications on HadoopApplications on Hadoop
Applications on Hadoop
markgrover1.4K views
Architecting a next generation data platform by hadooparchbook
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook2.5K views
Fraud Detection with Hadoop by markgrover
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
markgrover2.3K views
Hadoop application architectures - Fraud detection tutorial by hadooparchbook
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook2.5K views
Hadoop Security and Compliance - StampedeCon 2016 by StampedeCon
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
StampedeCon760 views
Architecting a Next Generation Data Platform by hadooparchbook
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook5.6K views
SQL Engines for Hadoop - The case for Impala by markgrover
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover1.2K views
2015 nov 27_thug_paytm_rt_ingest_brief_final by Adam Muise
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
Adam Muise1.9K views

Viewers also liked

Data warehousing with Hadoop by
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
7.7K views72 slides
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production by
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionBreaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionNeelesh Srinivas Salian
594 views18 slides
Hadoop secondary sort and a custom comparator by
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
2.6K views17 slides
Apache Hadoop - A Deep Dive (Part 1 - HDFS) by
Apache Hadoop - A Deep Dive (Part 1 - HDFS) Apache Hadoop - A Deep Dive (Part 1 - HDFS)
Apache Hadoop - A Deep Dive (Part 1 - HDFS) Debarchan Sarkar
1.1K views18 slides
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas... by
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
5K views62 slides
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa... by
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
451 views17 slides

Viewers also liked(14)

Data warehousing with Hadoop by hadooparchbook
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook7.7K views
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production by Neelesh Srinivas Salian
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionBreaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Hadoop secondary sort and a custom comparator by Subhas Kumar Ghosh
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh2.6K views
Apache Hadoop - A Deep Dive (Part 1 - HDFS) by Debarchan Sarkar
Apache Hadoop - A Deep Dive (Part 1 - HDFS) Apache Hadoop - A Deep Dive (Part 1 - HDFS)
Apache Hadoop - A Deep Dive (Part 1 - HDFS)
Debarchan Sarkar1.1K views
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas... by cwensel
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
cwensel5K views
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa... by Remy Rosenbaum
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
Remy Rosenbaum451 views
HBaseCon 2013: Using Apache HBase for Large Matrices by Cloudera, Inc.
HBaseCon 2013: Using Apache HBase for Large MatricesHBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large Matrices
Cloudera, Inc.5.3K views
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi... by Databricks
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks29.5K views
Top 5 Mistakes to Avoid When Writing Apache Spark Applications by Cloudera, Inc.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.127.8K views
HBase and HDFS: Understanding FileSystem Usage in HBase by enissoz
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz74K views
Big Data & Hadoop Tutorial by Edureka!
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!90.2K views

Similar to Application architectures with Hadoop and Sessionization in MR

Application Architectures with Hadoop | Data Day Texas 2015 by
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
3.9K views64 slides
DevOps-Roadmap by
DevOps-RoadmapDevOps-Roadmap
DevOps-RoadmapBnhNguynHuy1
34 views16 slides
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March... by
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...XBOSoft
1.6K views44 slides
Design for Scale / Surge 2010 by
Design for Scale / Surge 2010Design for Scale / Surge 2010
Design for Scale / Surge 2010Christopher Brown
1.4K views79 slides
Hadoop security @ Philly Hadoop Meetup May 2015 by
Hadoop security @ Philly Hadoop Meetup May 2015Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Shravan (Sean) Pabba
1.2K views39 slides
One Hadoop, Multiple Clouds - NYC Big Data Meetup by
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu
1K views45 slides

Similar to Application architectures with Hadoop and Sessionization in MR(20)

Application Architectures with Hadoop | Data Day Texas 2015 by Cloudera, Inc.
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.3.9K views
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March... by XBOSoft
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...
XBOSoft1.6K views
One Hadoop, Multiple Clouds - NYC Big Data Meetup by Andrei Savu
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu1K views
One Hadoop, Multiple Clouds by Cloudera, Inc.
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
Cloudera, Inc.1.1K views
Pivotal - Advanced Analytics for Telecommunications by Hortonworks
Pivotal - Advanced Analytics for Telecommunications Pivotal - Advanced Analytics for Telecommunications
Pivotal - Advanced Analytics for Telecommunications
Hortonworks801 views
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud by Stefan Lipp
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp403 views
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,... by Data Con LA
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA369 views
Microsoft Azure News - January 2015 by Daniel Toomey
Microsoft Azure News - January 2015Microsoft Azure News - January 2015
Microsoft Azure News - January 2015
Daniel Toomey570 views
Part 2: A Visual Dive into Machine Learning and Deep Learning 
 by Cloudera, Inc.
Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.1.5K views
Webinar: Productionizing Hadoop: Lessons Learned - 20101208 by Cloudera, Inc.
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.1.1K views
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack by Alluxio, Inc.
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.260 views
Zeronights 2015 - Big problems with big data - Hadoop interfaces security by Jakub Kałużny
Zeronights 2015 - Big problems with big data - Hadoop interfaces securityZeronights 2015 - Big problems with big data - Hadoop interfaces security
Zeronights 2015 - Big problems with big data - Hadoop interfaces security
Jakub Kałużny2.8K views
Introducing the data science sandbox as a service 8.30.18 by Cloudera, Inc.
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.1.2K views
Big Data Fundamentals 6.6.18 by Cloudera, Inc.
Big Data Fundamentals 6.6.18Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18
Cloudera, Inc.2.1K views
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead by DataWorks Summit
Cloud-Native Machine Learning: Emerging Trends and the Road AheadCloud-Native Machine Learning: Emerging Trends and the Road Ahead
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
DataWorks Summit800 views

More from markgrover

From discovering to trusting data by
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
423 views50 slides
Amundsen lineage designs - community meeting, Dec 2020 by
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
386 views19 slides
Amundsen at Brex and Looker integration by
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
436 views28 slides
REA Group's journey with Data Cataloging and Amundsen by
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
231 views19 slides
Amundsen gremlin proxy design by
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
217 views19 slides
Amundsen: From discovering to security data by
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
342 views68 slides

More from markgrover(20)

From discovering to trusting data by markgrover
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
markgrover423 views
Amundsen lineage designs - community meeting, Dec 2020 by markgrover
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
markgrover386 views
Amundsen at Brex and Looker integration by markgrover
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
markgrover436 views
REA Group's journey with Data Cataloging and Amundsen by markgrover
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
markgrover231 views
Amundsen gremlin proxy design by markgrover
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
markgrover217 views
Amundsen: From discovering to security data by markgrover
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover342 views
Amundsen: From discovering to security data by markgrover
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover134 views
Data Discovery & Trust through Metadata by markgrover
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
markgrover212 views
Data Discovery and Metadata by markgrover
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover611 views
The Lyft data platform: Now and in the future by markgrover
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover3.8K views
Disrupting Data Discovery by markgrover
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover2.1K views
TensorFlow Extension (TFX) and Apache Beam by markgrover
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
markgrover793 views
Big Data at Speed by markgrover
Big Data at SpeedBig Data at Speed
Big Data at Speed
markgrover244 views
Near real-time anomaly detection at Lyft by markgrover
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
markgrover1.9K views
Dogfooding data at Lyft by markgrover
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
markgrover493 views
Fighting cybersecurity threats with Apache Spot by markgrover
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
markgrover1.4K views
Top 5 mistakes when writing Spark applications by markgrover
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover394 views
Top 5 mistakes when writing Spark applications by markgrover
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover849 views
Introduction to Impala by markgrover
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover2.9K views
Introduction to Hive and HCatalog by markgrover
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
markgrover9.8K views

Recently uploaded

Basic Design Flow for Field Programmable Gate Arrays by
Basic Design Flow for Field Programmable Gate ArraysBasic Design Flow for Field Programmable Gate Arrays
Basic Design Flow for Field Programmable Gate ArraysUsha Mehta
5 views21 slides
Global airborne satcom market report by
Global airborne satcom market reportGlobal airborne satcom market report
Global airborne satcom market reportdefencereport78
7 views13 slides
Programmable Logic Devices : SPLD and CPLD by
Programmable Logic Devices : SPLD and CPLDProgrammable Logic Devices : SPLD and CPLD
Programmable Logic Devices : SPLD and CPLDUsha Mehta
8 views54 slides
Plant Design Report-Oil Refinery.pdf by
Plant Design Report-Oil Refinery.pdfPlant Design Report-Oil Refinery.pdf
Plant Design Report-Oil Refinery.pdfSafeen Yaseen Ja'far
9 views10 slides
AWS Certified Solutions Architect Associate Exam Guide_published .pdf by
AWS Certified Solutions Architect Associate Exam Guide_published .pdfAWS Certified Solutions Architect Associate Exam Guide_published .pdf
AWS Certified Solutions Architect Associate Exam Guide_published .pdfKiran Kumar Malik
5 views121 slides
Ansari: Practical experiences with an LLM-based Islamic Assistant by
Ansari: Practical experiences with an LLM-based Islamic AssistantAnsari: Practical experiences with an LLM-based Islamic Assistant
Ansari: Practical experiences with an LLM-based Islamic AssistantM Waleed Kadous
11 views29 slides

Recently uploaded(20)

Basic Design Flow for Field Programmable Gate Arrays by Usha Mehta
Basic Design Flow for Field Programmable Gate ArraysBasic Design Flow for Field Programmable Gate Arrays
Basic Design Flow for Field Programmable Gate Arrays
Usha Mehta5 views
Programmable Logic Devices : SPLD and CPLD by Usha Mehta
Programmable Logic Devices : SPLD and CPLDProgrammable Logic Devices : SPLD and CPLD
Programmable Logic Devices : SPLD and CPLD
Usha Mehta8 views
AWS Certified Solutions Architect Associate Exam Guide_published .pdf by Kiran Kumar Malik
AWS Certified Solutions Architect Associate Exam Guide_published .pdfAWS Certified Solutions Architect Associate Exam Guide_published .pdf
AWS Certified Solutions Architect Associate Exam Guide_published .pdf
Ansari: Practical experiences with an LLM-based Islamic Assistant by M Waleed Kadous
Ansari: Practical experiences with an LLM-based Islamic AssistantAnsari: Practical experiences with an LLM-based Islamic Assistant
Ansari: Practical experiences with an LLM-based Islamic Assistant
M Waleed Kadous11 views
Design_Discover_Develop_Campaign.pptx by ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth655 views
Integrating Sustainable Development Goals (SDGs) in School Education by SheetalTank1
Integrating Sustainable Development Goals (SDGs) in School EducationIntegrating Sustainable Development Goals (SDGs) in School Education
Integrating Sustainable Development Goals (SDGs) in School Education
SheetalTank111 views
GDSC Mikroskil Members Onboarding 2023.pdf by gdscmikroskil
GDSC Mikroskil Members Onboarding 2023.pdfGDSC Mikroskil Members Onboarding 2023.pdf
GDSC Mikroskil Members Onboarding 2023.pdf
gdscmikroskil68 views
Créativité dans le design mécanique à l’aide de l’optimisation topologique by LIEGE CREATIVE
Créativité dans le design mécanique à l’aide de l’optimisation topologiqueCréativité dans le design mécanique à l’aide de l’optimisation topologique
Créativité dans le design mécanique à l’aide de l’optimisation topologique
LIEGE CREATIVE8 views
REACTJS.pdf by ArthyR3
REACTJS.pdfREACTJS.pdf
REACTJS.pdf
ArthyR337 views
Web Dev Session 1.pptx by VedVekhande
Web Dev Session 1.pptxWeb Dev Session 1.pptx
Web Dev Session 1.pptx
VedVekhande20 views
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth by Innomantra
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth
Innomantra 20 views
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang78188 views
Unlocking Research Visibility.pdf by KhatirNaima
Unlocking Research Visibility.pdfUnlocking Research Visibility.pdf
Unlocking Research Visibility.pdf
KhatirNaima11 views

Application architectures with Hadoop and Sessionization in MR

  • 1. 1 Headline  Goes  Here   Speaker  Name  or  Subhead  Goes  Here   DO  NOT  USE  PUBLICLY   PRIOR  TO  10/23/12   ApplicaAon  Architectures  with   Apache  Hadoop   Mark  Grover  |  @mark_grover   East  Bay  JUG   hadooparchitecturebook.com   June  25th,  2014   ©2014 Cloudera, Inc. All Rights Reserved.
  • 2. About  Me   •  CommiTer  on  Apache  Bigtop,  commiTer  and  PPMC  member   on  Apache  Sentry  (incubaAng).   •  Contributor  to  Hadoop,  Hive,  Spark,  Sqoop,  Flume.   •  SoYware  developer  at  Cloudera   •  @mark_grover   2 ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. Co-­‐authoring  O’Reilly  book   •  @hadooparchbook   •  hadooparchitecturebook.com   ©2014 Cloudera, Inc. All Rights Reserved. 3
  • 4. What  is  Apache  Hadoop?   4 Has  the  Flexibility  to  Store  and   Mine  Any  Type  of  Data     §  Ask  quesAons  across  structured  and   unstructured  data  that  were  previously   impossible  to  ask  or  solve   §  Not  bound  by  a  single  schema   Excels  at   Processing  Complex  Data     §  Scale-­‐out  architecture  divides  workloads   across  mulAple  nodes   §  Flexible  file  system  eliminates  ETL   boTlenecks   Scales   Economically     §  Can  be  deployed  on  commodity   hardware   §  Open  source  plaaorm  guards  against   vendor  lock   Hadoop  Distributed   File  System  (HDFS)     Self-­‐Healing,  High   Bandwidth  Clustered   Storage       MapReduce     Distributed  CompuAng   Framework   Apache Hadoop  is  an  open  source   plaaorm  for  data  storage  and  processing   that  is…   ü  Scalable   ü  Fault  tolerant   ü  Distributed   CORE  HADOOP  SYSTEM  COMPONENTS   ©2013 Cloudera, Inc. All Rights Reserved.
  • 5. 5 Click  Stream  Analysis   Case  Study   ©2014 Cloudera, Inc. All Rights Reserved.
  • 6. AnalyAcs   ©2014 Cloudera, Inc. All Rights Reserved. 6  
  • 7. Web  Logs   ©2014 Cloudera, Inc. All Rights Reserved. 7   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 8. Clickstream  AnalyAcs   ©2014 Cloudera, Inc. All Rights Reserved. 8   244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ 36.0.1944.0 Safari/537.36”
  • 9. Click  Stream  Analysis  (Before  Hadoop)   9   ©2014 Cloudera, Inc. All Rights Reserved. Web  logs   ODS   Data   Warehouse   Query  Extract   Transform   Load   Business   Intelligence   Transform  
  • 10. The  Problems  (Before  Hadoop)   10   ©2014 Cloudera, Inc. All Rights Reserved. Web  logs   ODS   Data   Warehouse   Query  Extract   Transform   Load   Business   Intelligence   Transform   1   1   1   Slow  Data  TransformaAons  =  Missed  ETL  SLAs.   2   2   Slow  Queries  =  Frustrated  Business  Users.  
  • 11. Click  Stream  Analysis  (AYer  Hadoop)   11   ©2014 Cloudera, Inc. All Rights Reserved. Web  logs   ODS   Data   Warehouse   Query  Extract   Transform   Load   Business   Intelligence   Transform  X  X  
  • 12. Web  logs   ODS   Business   Intelligence   Query   Click  Stream  Analysis  (AYer  Hadoop)   12   ©2014 Cloudera, Inc. All Rights Reserved. Transform  
  • 13. Web  logs   ODS   Business   Intelligence   Query   Click  Stream  Analysis  (AYer  Hadoop)   13   ©2014 Cloudera, Inc. All Rights Reserved. Transform   Flume  or  Sqoop?   Flume  or  Sqoop?   Hive/Impala/MR?   Hive/Impala/Spark?  
  • 14. Challenges  of  Hadoop  ImplementaAon   14   ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. Challenges  of  Hadoop  ImplementaAon   15   ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. Other  challenges  -­‐  Architectural  ConsideraAons     •  Storage  managers?   •  HDFS?  HBase?   •  Data  storage  and  modeling:   •  File  formats?  Compression?  Schema  design?   •  Data  movement   •  How  do  we  actually  get  the  data  into  Hadoop?  How  do  we  get  it  out?   •  Metadata   •  How  do  we  manage  data  about  the  data?   •  Data  access  and  processing   •  How  will  the  data  be  accessed  once  in  Hadoop?  How  can  we  transform  it?  How  do   we  query  it?   •  OrchestraAon   •  How  do  we  manage  the  workflow  for  all  of  this?   16 ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. 17 High  level  Design   ©2014 Cloudera, Inc. All Rights Reserved.
  • 18. Clickstream  AnalyAcs   ©2014 Cloudera, Inc. All Rights Reserved. 18   244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ 36.0.1944.0 Safari/537.36”
  • 19. ©2014 Cloudera, Inc. All Rights Reserved. 19   Hadoop   Cluster   BI/VisualizaAon   tool  (e.g.   microstrategy)   BI   Analysts   Spark   For  machine  learning   and  graph  processing   R/Python   StaAsAcal  Analysis   Custom   Apps   3.  Accessing   2.  Processing   4.  OrchestraAon   via  Oozie  1.  IngesAon   OperaAonal   Data  Store   CRM  System   Via  Sqoop   Web  servers   Website  users  
  • 20. 20 Since  that’s  of  most  interest  to  this  audience   2.  Processing   ©2014 Cloudera, Inc. All Rights Reserved.
  • 21. Processing   •  De-­‐duplicaAon   •  Filtering   •  SessionizaAon   21 ©2014 Cloudera, Inc. All Rights Reserved.
  • 22. DeduplicaAon  –  remove  duplicate  records   ©2014 Cloudera, Inc. All Rights Reserved. 22   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
  • 23. Filtering  –  filter  out  invalid  records   ©2014 Cloudera, Inc. All Rights Reserved. 23   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
  • 24. SessionizaAon   ©2014 Cloudera, Inc. All Rights Reserved. 24   Website  visit   Visitor  1   Session  1   Visitor  1   Session  2   Visitor  2   Session  1   >  30  minutes  
  • 25. Why  sessionize?   Helps  answers  quesAons  like:   •  What  is  my  website’s  bounce  rate?   •  i.e.  how  many  %  of  visitors  don’t  go  past  the  landing  page?   •  Which  markeAng  channels  (e.g.  organic  search,  display  ad,  etc.)   are  leading  to  most  sessions?   •  Which  ones  of  those  lead  to  most  conversions  (e.g.  people   buying  things,  signing  up,  etc.)   •  Do  aTribuAon  analysis  –  which  channels  are  responsible  for   most  conversions?   25 ©2014 Cloudera, Inc. All Rights Reserved.
  • 26. SessionizaAon   ©2014 Cloudera, Inc. All Rights Reserved. 26   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166
  • 27. How  to  Sessionize?   1.  Given  a  list  of  clicks,  determine  which  clicks  came  from  the   same  user   2.  Given  a  parAcular  user's  clicks,  determine  if  a  given  click  is  a   part  of  a  new  session  or  a  conAnuaAon  of  the  previous   session   27 ©2014 Cloudera, Inc. All Rights Reserved.
  • 28. #1  –  Which  clicks  are  from  same  user?   •  We  can  use:   •  IP  address  (244.157.45.12)   •  Cookies  (A9A3BECE0563982D)   •  IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36")   28 ©2014 Cloudera, Inc. All Rights Reserved.
  • 29. #1  –  Which  clicks  are  from  same  user?   •  We  can  use:   •  IP  address  (244.157.45.12)   •  Cookies  (A9A3BECE0563982D)   •  IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36")   29 ©2014 Cloudera, Inc. All Rights Reserved.
  • 30. #1  –  Which  clicks  are  from  same  user?   ©2014 Cloudera, Inc. All Rights Reserved. 30   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 31. #2  –  Which  clicks    part  of  the  same  session?   ©2014 Cloudera, Inc. All Rights Reserved. 31   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” >  30  mins  apart  =  different   sessions  
  • 32. 32 Intro  to  MapReduce   ©2014 Cloudera, Inc. All Rights Reserved.
  • 33. MapReduce   •  Map  –  Apply  a  funcAon  to  each  input  record   •  Shuffle  &  Sort  –  ParAAon  the  map  output  and  sort  each   parAAon   •  Reduce  –  Apply  aggregaAon  funcAon  to  all  values  in  each   parAAon   33 ©2014 Cloudera, Inc. All Rights Reserved.
  • 34. 34 SessionizaAon  in  MapReduce   ©2014 Cloudera, Inc. All Rights Reserved.
  • 36. SessionizaAon  in  MapReduce   36 ©2014 Cloudera, Inc. All Rights Reserved. Map   Reduce   Reduce   Log  line   IP1,  log  lines   IP1,  log  lines   Log  line,  session  ID   Map   Map   Log  line   Log  line   IP2,  log  lines   IP2,  log  lines   Log  line,  session  ID  
  • 37. Mapper  for  SessionizaAon   37 ©2014 Cloudera, Inc. All Rights Reserved.  public  static  class  SessionizeMapper                          extends  Mapper<Object,  Text,  IpTimestampKey,  Text>  {                  private  Matcher  logRecordMatcher;                    public  void  map(Object  key,  Text  value,  Context  context                  )  throws  IOException,  InterruptedException  {                          logRecordMatcher  =  logRecordPattern.matcher(value.toString());                            //  We  only  emit  something  out  if  the  record  matches  with  our  regex.  Otherwise,  we   assume  the  record  is  busted  and  simply  ignore  it                          if  (logRecordMatcher.matches())  {                                  String  ip  =  logRecordMatcher.group(1);                                  DateTime  timestamp  =  DateTime.parse(logRecordMatcher.group(2),   TIMESTAMP_FORMATTER);                                  Long  unixTimestamp  =  timestamp.getMillis();                                  IpTimestampKey  outputKey  =  new  IpTimestampKey(ip,  unixTimestamp);                                  context.write(outputKey,  value);                          }                  }          }  
  • 38. Reducer  for  SessionizaAon   38 ©2014 Cloudera, Inc. All Rights Reserved.  public  static  class  SessionizeReducer                          extends  Reducer<IpTimestampKey,  Text,  IpTimestampKey,  Text>  {                  private  Text  result  =  new  Text();                  private  static  int  sessionId  =  0;                  private  Long  lastTimeStamp  =  null;                  public  void  reduce(IpTimestampKey  key,  Iterable<Text>  values,  Context  context                  )  throws  IOException,  InterruptedException  {                          for  (Text  value  :  values)  {                                  String  logRecord  =  value.toString();                                  //  If  this  is  the  first  record  for  this  user  or  it's  been  more  than  the  timeout                                    //  since  the  last  click  from  this  user,  let's  increment  the  session  ID.                                  if  (lastTimeStamp  ==  null  ||  (key.getUnixTimestamp()  -­‐  lastTimeStamp  >   SESSION_TIMEOUT_IN_MS))  {                                          sessionId++;                                  }                                  lastTimeStamp  =  key.getUnixTimestamp();                                  result.set(logRecord  +  "  "  +  sessionId);                                  //  Since  we  only  care  about  printing  out  the  entire  record  in  the  result,  with                                  //  session  ID  appended  at  the  end,  we  just  emit  out  "null"  for  the  key                                  context.write(null,  result);                          }                  }          }  
  • 39. Secondary  sorAng  –  by  Amestamp   •  Need  records  to  reducer  to  be  grouped  by  IP  address  and   sorted  by  Amestamp  –  a  concept  called  secondary  sor/ng   •  Instead  of  using  just  IP  address  as  map  output  key  and  reduce   input  key   •  We  use  a  composite  key  (IP,  Amestamp)  as  map  output  key  and   reduce  input  key   39 ©2014 Cloudera, Inc. All Rights Reserved.
  • 40. Secondary  sorAng  –  vocabulary   •  Composite  key  –  IP  address,  Amestamp   •  Natural  key  –  IP  address   •  Secondary  sort  key  -­‐  Amestamp   40 ©2014 Cloudera, Inc. All Rights Reserved.
  • 41. Secondary  sorAng   •  Custom  Grouping  Comparator  –  on  Natural  Key  (IP)   •  Custom  Sort  Comparator  –  on  Composite  Key  (IP,  address)   •  Custom  ParAAoner  –  on  Natural  Key  (IP)     job.setGroupingComparatorClass(NaturalKeyComparator.class );         job.setSortComparatorClass(CompositeKeyComparator.class);   job.setPartitionerClass(NaturalKeyPartitioner.class);     41 ©2014 Cloudera, Inc. All Rights Reserved.
  • 42. 42 Data  Storage  and  Modeling   ©2014 Cloudera, Inc. All Rights Reserved.
  • 43. Data  Storage  –  Storage  Manager  consideraAons   •  Popular  storage  managers  for  Hadoop   •  Hadoop  Distributed  File  System  (HDFS)   •  HBase   43 ©2014 Cloudera, Inc. All Rights Reserved.
  • 44. Data  Storage  –  HDFS  vs  HBase   HDFS   •  Stores  data  directly  as  files   •  Fast  scans   •  Poor  random  reads/writes   HBase   •  Stores  data  as  Hfiles  on  HDFS   •  Slow  scans   •  Fast  random  reads/writes   44   ©2014 Cloudera, Inc. All Rights Reserved.
  • 45. Data  Storage  –  Storage  Manager  consideraAons   •  We  choose  HDFS   •  AnalyAcal  needs  in  this  case  served  beTer  by  fast  scans.   45 ©2014 Cloudera, Inc. All Rights Reserved.
  • 46. 46 Data  Storage  Format   ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. Data  Storage  –  Format  ConsideraAons     •  Store  as  plain  text?   •  Sure,  well  supported  by  Hadoop.   •  Text  can  easily  be  processed  by  MapReduce,  loaded  into  Hive  for   analysis,  and  so  on.   •  But…   •  Will  begin  to  consume  lots  of  space  in  HDFS.   •  May  not  be  opAmal  for  processing  by  tools  in  the  Hadoop   ecosystem.   47 ©2014 Cloudera, Inc. All Rights Reserved.
  • 48. Data  Storage  –  Format  ConsideraAons     •  But,  we  can  compress  the  text  files…   •  Gzip  –  supported  by  Hadoop,  but  not  spliTable.   •  Bzip2  –  hey,  spliTable!  Great  compression!  But  decompression  is   slooowww.   •  LZO  –  spliTable  (with  some  work),  good  compress/de-­‐compress   performance.  Good  choice  for  storing  text  files  on  Hadoop.     •  Snappy  –  provides  a  good  tradeoff  between  size  and  speed.     48 ©2014 Cloudera, Inc. All Rights Reserved.
  • 49. Data  Storage  –  More  About  Snappy   •  Designed  at  Google  to  provide  high  compression  speeds  with   reasonable  compression.   •  Not  the  highest  compression,  but  provides  very  good  performance   for  processing  on  Hadoop.   •  Snappy  is  not  spliTable  though,  which  brings  us  to…     49 ©2014 Cloudera, Inc. All Rights Reserved.
  • 50. SequenceFile   • Stores  records  as  binary   key/value  pairs.   • SequenceFile  “blocks”   can  be  compressed.   • This  enables  spliTability   with  non-­‐spliTable   compression.       50   ©2014 Cloudera, Inc. All Rights Reserved.
  • 51. Avro   •  Kinda  SequenceFile  on   Steroids.   •  Self-­‐documenAng  –  stores   schema  in  header.   •  Provides  very  efficient   storage.   •  Supports  spliTable   compression.   51   ©2014 Cloudera, Inc. All Rights Reserved.
  • 52. Our  Format  Choices…   •  Avro  with  Snappy   •  Snappy  provides  opAmized  compression.   •  Avro  provides  compact  storage,  self-­‐documenAng  files,  and   supports  schema  evoluAon.   •  Avro  also  provides  beTer  failure  handling  than  other  choices.   •  SequenceFiles  would  also  be  a  good  choice,  and  are  directly   supported  by  ingesAon  tools  in  the  ecosystem.   •  But  only  supports  Java.   52 ©2014 Cloudera, Inc. All Rights Reserved.
  • 53. 53 HDFS  Schema  Design   ©2014 Cloudera, Inc. All Rights Reserved.
  • 54. Recommended  HDFS  Schema  Design   •  How  to  lay  out  data  on  HDFS?   54 ©2014 Cloudera, Inc. All Rights Reserved.
  • 55. Recommended  HDFS  Schema  Design   /user/<username>  -­‐  User  specific  data,  jars,  conf  files   /etl  –  Data  in  various  stages  of  ETL  workflow   /tmp  –  temp  data  from  tools  or  shared  between  users   /data  –  shared  data  for  the  enAre  organizaAon   /app  –  Everything  but  data:  UDF  jars,  HQL  files,  Oozie  workflows   55 ©2014 Cloudera, Inc. All Rights Reserved.
  • 56. 56 Advanced  HDFS  Schema  Design   ©2014 Cloudera, Inc. All Rights Reserved.
  • 57. What  is  ParAAoning?   57 dataset        col=val1/file.txt        col=val2/file.txt          .          .          .        col=valn/file.txt   dataset      file1.txt      file2.txt          .          .          .        filen.txt   Un-­‐parAAoned  HDFS   directory  structure   ParAAoned  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 58. What  is  ParAAoning?   58 clicks        dt=2014-­‐01-­‐01/clicks.txt        dt=2014-­‐01-­‐02/clicks.txt          .          .          .        dt=2014-­‐03-­‐31/clicks.txt   clicks      clicks-­‐2014-­‐01-­‐01.txt      clicks-­‐2014-­‐01-­‐02.txt          .          .          .        clicks-­‐2014-­‐03-­‐31.txt   Un-­‐parAAoned  HDFS   directory  structure   ParAAoned  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 59. ParAAoning   •  Split  the  dataset  into  smaller  consumable  chunks   •  Rudimentary  form  of  “indexing”   •  <data  set  name>/ <parAAon_column_name=parAAon_column_value>/{files}   59 ©2014 Cloudera, Inc. All Rights Reserved.
  • 60. ParAAoning  consideraAons   •  What  column  to  bucket  by?   •  HDFS  is  append  only.   •  Don’t  have  too  many  parAAons  (<10,000)   •  Don’t  have  too  many  small  files  in  the  parAAons  (more  than   block  size  generally)   •  We  decided  to  parAAon  by  /mestamp   60 ©2014 Cloudera, Inc. All Rights Reserved.
  • 61. What  is  buckeAng?   61 clicks        dt=2014-­‐01-­‐01/clicks.txt          dt=2014-­‐01-­‐02/clicks.txt   Un-­‐bucketed  HDFS   directory  structure   clicks        dt=2014-­‐01-­‐01/file0.txt        dt=2014-­‐01-­‐01/file1.txt        dt=2014-­‐01-­‐01/file2.txt        dt=2014-­‐01-­‐01/file3.txt          dt=2014-­‐01-­‐02/file0.txt        dt=2014-­‐01-­‐02/file1.txt        dt=2014-­‐01-­‐02/file2.txt        dt=2014-­‐01-­‐02/file3.txt   Bucketed  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 62. BuckeAng   •  Hash-­‐bucketed  files  within  each  parAAon  based  on  a  parAcular   column   •  Useful  when  sampling   •  In  some  joins,  pre-­‐reqs:   •  Datasets  bucketed  on  the  same  key  as  the  join  key   •  Number  of  buckets  are  the  same  or  one  is  a  mulAple  of  the  other   62 ©2014 Cloudera, Inc. All Rights Reserved.
  • 63. BuckeAng  consideraAons?   •  Which  column  to  bucket  on?   •  How  many  buckets?   •  We  decided  to  bucket  based  on  cookie   63 ©2014 Cloudera, Inc. All Rights Reserved.
  • 64. De-­‐normalizing  consideraAons   •  In  general,  big  data  joins  are  expensive   •  When  to  de-­‐normalize?   •  Decided  to  join  the  smaller  dimension  tables   •  Big  fact  tables  are  sAll  joined   64 ©2014 Cloudera, Inc. All Rights Reserved.
  • 65. 65 Data  IngesAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 66. File  Transfers     • “hadoop  fs  –put  <file>”   • Reliable,  but  not  resilient   to  failure.   • Other  opAons  are   mountable  HDFS,  for   example  NFSv3.   66   ©2014 Cloudera, Inc. All Rights Reserved.
  • 67. Streaming  IngesAon   •  Flume   •  Reliable,  distributed,  and  available  system  for  efficient  collecAon,   aggregaAon  and  movement  of  streaming  data,  e.g.  logs.   •  Ka•a   •  Reliable  and  distributed  publish-­‐subscribe  messaging  system.   67 ©2014 Cloudera, Inc. All Rights Reserved.
  • 68. Flume  vs.  Ka•a   • Purpose  built  for  Hadoop   data  ingest.   • Pre-­‐built  sinks  for  HDFS,   HBase,  etc.   • Supports  transformaAon   of  data  in-­‐flight.   • General  pub-­‐sub   messaging  framework.   • Hadoop  not  supported,   requires  3rd-­‐party   component  (Camus).   • Just  a  message  transport   (a  very  fast  one).   68   ©2014 Cloudera, Inc. All Rights Reserved.
  • 69. Flume  vs.  Ka•a   •  BoTom  line:   •  Flume  very  well  integrated  with  Hadoop  ecosystem,  well  suited   to  ingesAon  of  sources  such  as  log  files.   •  Ka•a  is  a  highly  reliable  and  scalable  enterprise  messaging   system,  and  great  for  scaling  out  to  mulAple  consumers.   69 ©2014 Cloudera, Inc. All Rights Reserved.
  • 70. A  Quick  IntroducAon  to  Flume   70   Flume  Agent   Source   Channel   Sink   DesAnaAon  External   Source   Web  Server   TwiTer   JMS   System  logs   …   Consumes  events   and  forwards  to   channels   Stores  events   unAl  consumed   by  sinks  –  file,   memory,  JDBC   Removes  event  from   channel  and  puts   into  external   desAnaAon   JVM    process  hosAng  components   ©2014 Cloudera, Inc. All Rights Reserved.
  • 71. A  Quick  IntroducAon  to  Flume   •  Reliable  –  events  are  stored  in  channel  unAl  delivered  to  next  stage.   •  Recoverable  –  events  can  be  persisted  to  disk  and  recovered  in  the   event  of  failure.   71 Flume  Agent   Source   Channel   Sink   DesAnaAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 72. A  Quick  IntroducAon  to  Flume   • DeclaraAve     •  No  coding  required.   •  ConfiguraAon  specifies   how  components  are   wired  together.   72   ©2014 Cloudera, Inc. All Rights Reserved.
  • 73. A  Brief  Discussion  of  Flume  PaTerns  –  Fan-­‐in   • Flume  agent  runs  on   each  of  our  servers.   • These  agents  send  data   to  mulAple  agents  to   provide  reliability.   • Flume  provides  support   for  load  balancing.   73   ©2014 Cloudera, Inc. All Rights Reserved.
  • 74. A  Brief  Discussion  of  Flume  PaTerns  –  Spli‚ng   •  Common  need  is  to  split   data  on  ingest.   •  For  example:   •  Sending  data  to  mulAple   clusters  for  DR.   •  To  mulAple  desAnaAons.   •  Flume  also  supports   parAAoning,  which  is  key   to  our  implementaAon.   74   ©2014 Cloudera, Inc. All Rights Reserved.
  • 75. Sqoop  Overview   •  Apache  project  designed  to  ease  import  and  export  of  data   between  Hadoop  and  external  data  stores  such  as  relaAonal   databases.   •  Great  for  doing  bulk  imports  and  exports  of  data  between   HDFS,  Hive  and  HBase  and  an  external  data  store.  Not  suited   for  ingesAng  event  based  data.   ©2014 Cloudera, Inc. All Rights Reserved. 75
  • 76. IngesAon  Decisions   •  Historical  Data   •  Smaller  files:  file  transfer   •  Larger  files:  Flume  with  spooling  directory  source.   •  Incoming  Data   •  Flume  with  the  spooling  directory  source.   76 ©2014 Cloudera, Inc. All Rights Reserved.
  • 77. 77 Data  Processing  and  Access   ©2014 Cloudera, Inc. All Rights Reserved.
  • 78. Data  flow   78   Raw  data   ParAAoned   clickstream   data   Other  data   (Financial,   CRM,  etc.)   Aggregated   dataset  #2   Aggregated   dataset  #1   ©2014 Cloudera, Inc. All Rights Reserved.
  • 79. Data  processing  tools   79   •  Hive   •  Impala   •  Pig,  etc.   ©2014 Cloudera, Inc. All Rights Reserved.
  • 80. Hive   80   •  Open  source  data  warehouse  system  for  Hadoop   •  Converts  SQL-­‐like  queries  to  MapReduce  jobs   •  Work  is  being  done  to  move  this  away  from  MR   •  Stores  metadata  in  Hive  metastore   •  Can  create  tables  over  HDFS  or  HBase  data   •  Access  available  via  JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 81. Impala   81   •  Real-­‐Ame  open  source  SQL  query  engine  for  Hadoop   •  Doesn’t  build  on  MapReduce   •  WriTen  in  C++,  uses  LLVM  for  run-­‐Ame  code  generaAon   •  Can  create  tables  over  HDFS  or  HBase  data   •  Accesses  Hive  metastore  for  metadata   •  Access  available  via  JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 82. Pig   82   •  Higher  level  abstracAon  over  MapReduce  (like  Hive)   •  Write  transformaAons  in  scripAng  language  –  Pig  LaAn   •  Can  access  Hive  metastore  via  HCatalog  for  metadata   ©2014 Cloudera, Inc. All Rights Reserved.
  • 83. Data  Processing  consideraAons   83   •  We  chose  Hive  for  ETL    and  Impala  for  interac/ve  BI.   ©2014 Cloudera, Inc. All Rights Reserved.
  • 84. 84 Metadata  Management   ©2014 Cloudera, Inc. All Rights Reserved.
  • 85. What  is  Metadata?   85   •  Metadata  is  data  about  the  data   •  Format  in  which  data  is  stored   •  Compression  codec   •  LocaAon  of  the  data   •  Is  the  data  parAAoned/bucketed/sorted?   ©2014 Cloudera, Inc. All Rights Reserved.
  • 86. Metadata  in  Hive   86 Hive   Metastore   ©2014 Cloudera, Inc. All Rights Reserved.
  • 87. Metadata   87   •  Hive  metastore  has  become  the  de-­‐facto  metadata  repository   •  HCatalog  makes  Hive  metastore  accessible  to  other   applicaAons  (Pig,  MapReduce,  custom  apps,  etc.)   ©2014 Cloudera, Inc. All Rights Reserved.
  • 88. Hive  +  HCatalog   88   ©2014 Cloudera, Inc. All Rights Reserved.
  • 89. 89 OrchestraAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 90. OrchestraAon   •  Once  the  data  is  in  Hadoop,  we  need  a  way  to  manage   workflows  in  our  architecture.   •  Scheduling  and  tracking  MapReduce  jobs,  Hive  jobs,  etc.   •  Several  opAons  here:   •  Cron   •  Oozie,  Azkaban   •  3rd-­‐party  tools,  Talend,  Pentaho,  InformaAca,  enterprise   schedulers.   90 ©2014 Cloudera, Inc. All Rights Reserved.
  • 91. Oozie   • Supports  defining  and   execuAng  a  sequence  of   jobs.   • Can  trigger  jobs  based  on   external  dependencies  or   schedules.   91   ©2014 Cloudera, Inc. All Rights Reserved.
  • 92. 92 Final  Architecture   ©2014 Cloudera, Inc. All Rights Reserved.
  • 93. Final  Architecture  –  High  Level  Overview   93   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 94. Final  Architecture  –  High  Level  Overview   94   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 95. Final  Architecture  –  IngesAon   95   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Flume  Agent   Flume  Agent   Flume  Agent   Flume  Agent   Fan-­‐in     PaTern   MulA  Agents  for     Failover  and  rolling  restarts   HDFS     ©2014 Cloudera, Inc. All Rights Reserved.
  • 96. Final  Architecture  –  High  Level  Overview   96   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 97. Final  Architecture  –  Storage  and  Processing   97   /etl/weblogs/20140331/   /etl/weblogs/20140401/   …   Data  Processing   /data/markeAng/clickstream/bouncerate/   /data/markeAng/clickstream/aTribuAon/   …   ©2014 Cloudera, Inc. All Rights Reserved.
  • 98. Final  Architecture  –  High  Level  Overview   98   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 99. Final  Architecture  –  Data  Access   99   Hive/ Impala   BI/ AnalyAcs   Tools   DWH   Sqoop   Local   Disk   R,  etc.   DB  import  tool   JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 100. Contact  info   •  Mark  Grover   •  @mark_grover   •  www.linkedin.com/in/grovermark   •  Slides  at  slideshare.net/markgrover   100 ©2014 Cloudera, Inc. All Rights Reserved.
  • 101. 101 ©2014 Cloudera, Inc. All Rights Reserved.