SlideShare a Scribd company logo
1 of 160
Download to read offline
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
for	
  Hadoop	
  Applica8ons	
  
Mark	
  Grover	
  |	
  @mark_grover	
  
Jonathan	
  Seidman	
  |	
  @jseidman	
  	
  
Gwen	
  Shapira	
  |	
  @gwenshap	
  
IEEE	
  BigDataService	
  2015	
  |	
  March	
  30th,	
  2015	
  
8ny.cloudera.com/app-­‐arch-­‐slides	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
About	
  the	
  book	
  
•  @hadooparchbook	
  
•  hadooparchitecturebook.com	
  
•  github.com/hadooparchitecturebook	
  
•  slideshare.com/hadooparchbook	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
About	
  the	
  presenters	
  
• Principal	
  Solu8ons	
  Architect	
  
at	
  Cloudera	
  
• Previously,	
  lead	
  architect	
  at	
  
FINRA	
  
• Contributor	
  to	
  Apache	
  
Hadoop,	
  HBase,	
  Flume,	
  
Avro,	
  Pig	
  and	
  Spark	
  
• Senior	
  Solu8ons	
  Architect/
Partner	
  Enablement	
  at	
  
Cloudera	
  
• Previously,	
  Technical	
  Lead	
  
on	
  the	
  big	
  data	
  team	
  at	
  
Orbitz	
  Worldwide	
  
• Co-­‐founder	
  of	
  the	
  Chicago	
  
Hadoop	
  User	
  Group	
  and	
  
Chicago	
  Big	
  Data	
  
Ted	
  Malaska	
   Jonathan	
  Seidman	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
About	
  the	
  presenters	
  
• Solu8ons	
  Architect	
  turned	
  
So]ware	
  Engineer	
  at	
  
Cloudera	
  
• Commi^er	
  on	
  Apache	
  
Sqoop	
  
• Contributor	
  to	
  Apache	
  
Flume	
  and	
  Apache	
  Kaaa	
  
• So]ware	
  Engineer	
  at	
  
Cloudera	
  
• Commi^er	
  on	
  Apache	
  
Bigtop,	
  PMC	
  member	
  on	
  
Apache	
  Sentry	
  (incuba8ng)	
  
• Contributor	
  to	
  Apache	
  
Hadoop,	
  Spark,	
  Hive,	
  Sqoop,	
  
Pig	
  and	
  Flume	
  
Gwen	
  Shapira	
   Mark	
  Grover	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Logis8cs	
  
•  Break	
  at	
  10:30-­‐11:00	
  PM	
  
•  Ques8ons	
  at	
  the	
  end	
  of	
  each	
  sec8on	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Case	
  Study	
  
Clickstream	
  Analysis	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analy8cs	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analy8cs	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analy8cs	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analy8cs	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analy8cs	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analy8cs	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analy8cs	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Web	
  Logs	
  –	
  Combined	
  Log	
  Format	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Clickstream	
  Analy8cs	
  
244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/
537.36”
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Similar	
  use-­‐cases	
  
•  Sensors	
  –	
  heart,	
  agriculture,	
  etc.	
  
•  Casinos	
  –	
  session	
  of	
  a	
  person	
  at	
  a	
  table	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Pre-­‐Hadoop	
  Architecture	
  
Clickstream	
  Analysis	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Click	
  Stream	
  Analysis	
  (Before	
  Hadoop)	
  
Web	
  logs	
  
(full	
  fidelity)	
  
(2	
  weeks)	
  
Data	
  Warehouse	
  
Transform/
Aggregate	
   Business	
  
Intelligence	
  
Tape	
  Archive	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  with	
  Pre-­‐Hadoop	
  Architecture	
  
•  Full	
  fidelity	
  data	
  is	
  stored	
  for	
  small	
  amount	
  of	
  8me	
  (~weeks).	
  
•  Older	
  data	
  is	
  sent	
  to	
  tape,	
  or	
  even	
  worse,	
  deleted!	
  
•  Inflexible	
  workflow	
  -­‐	
  think	
  of	
  all	
  aggrega8ons	
  beforehand	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Effects	
  of	
  Pre-­‐Hadoop	
  Architecture	
  
•  Regenera8ng	
  aggregates	
  is	
  expensive	
  or	
  worse,	
  impossible	
  
•  Can’t	
  correct	
  bugs	
  in	
  the	
  workflow/aggrega8on	
  logic	
  
•  Can’t	
  do	
  experiments	
  on	
  exis8ng	
  data	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  is	
  Hadoop	
  A	
  Great	
  Fit?	
  
Clickstream	
  Analysis	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  is	
  Hadoop	
  a	
  great	
  fit?	
  	
  
•  Volume	
  of	
  clickstream	
  data	
  is	
  huge	
  
•  Velocity	
  at	
  which	
  it	
  comes	
  in	
  is	
  high	
  
•  Variety	
  of	
  data	
  is	
  diverse	
  -­‐	
  semi-­‐structured	
  data	
  
•  Hadoop	
  enables	
  
• ac8ve	
  archival	
  of	
  data	
  
• Aggrega8on	
  jobs	
  
• Querying	
  the	
  above	
  aggregates	
  or	
  raw	
  fidelity	
  data	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Click	
  Stream	
  Analysis	
  (with	
  Hadoop)	
  
Web	
  logs	
   Hadoop	
   Business	
  
Intelligence	
  
Ac8ve	
  archive	
  (no	
  tape)	
  
Aggrega8on	
  engine	
  
Querying	
  engine	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Challenges	
  of	
  Hadoop	
  Implementa8on	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Challenges	
  of	
  Hadoop	
  Implementa8on	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Other	
  challenges	
  -­‐	
  Architectural	
  Considera8ons	
  	
  
•  Storage	
  managers?	
  
•  HDFS?	
  HBase?	
  
•  Data	
  storage	
  and	
  modeling:	
  
•  File	
  formats?	
  Compression?	
  Schema	
  design?	
  
•  Data	
  movement	
  
•  How	
  do	
  we	
  actually	
  get	
  the	
  data	
  into	
  Hadoop?	
  How	
  do	
  we	
  get	
  it	
  out?	
  
•  Metadata	
  
•  How	
  do	
  we	
  manage	
  data	
  about	
  the	
  data?	
  
•  Data	
  access	
  and	
  processing	
  
•  How	
  will	
  the	
  data	
  be	
  accessed	
  once	
  in	
  Hadoop?	
  How	
  can	
  we	
  transform	
  it?	
  How	
  do	
  we	
  query	
  
it?	
  
•  Orchestra8on	
  
•  How	
  do	
  we	
  manage	
  the	
  workflow	
  for	
  all	
  of	
  this?	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Case	
  Study	
  Requirements	
  
Overview	
  of	
  Requirements	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Overview	
  of	
  Requirements	
  
Data	
  
Sources	
  
Inges8on	
  
Raw	
  Data	
  
Storage	
  
(Formats,	
  
Schema)	
  
Processed	
  
Data	
  Storage	
  
(Formats,	
  
Schema)	
  
Processing	
  
Data	
  
Consump8on	
  
Orchestra8on	
  
(Scheduling,	
  
Managing,	
  
Monitoring)	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Case	
  Study	
  Requirements	
  
Data	
  Inges8on	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Inges8on	
  Requirements	
  
Web	
  Servers	
  
Web	
  Servers	
  
Web	
  Servers	
  
Web	
  Servers	
  
Web	
  Servers	
  
Logs	
  	
  
244.157.45.12 - - [17/
Oct/2014:21:08:30 ]
"GET /seatposts HTTP/
1.0" 200 4463 …	
  
CRM	
  Data	
  
ODS	
  
Web	
  Servers	
  
Web	
  Servers	
  
Web	
  Servers	
  
Web	
  Servers	
  
Web	
  Servers	
  
Logs	
  	
  
Hadoop	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Inges8on	
  Requirements	
  
•  So	
  we	
  need	
  to	
  be	
  able	
  to	
  support:	
  
• Reliable	
  inges8on	
  of	
  large	
  volumes	
  of	
  semi-­‐structured	
  event	
  data	
  arriving	
  with	
  
high	
  velocity	
  (e.g.	
  logs).	
  
• Timeliness	
  of	
  data	
  availability	
  –	
  data	
  needs	
  to	
  be	
  available	
  for	
  processing	
  to	
  
meet	
  business	
  service	
  level	
  agreements.	
  
• Periodic	
  inges8on	
  of	
  data	
  from	
  rela8onal	
  data	
  stores.	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Case	
  Study	
  Requirements	
  
Data	
  Storage	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Storage	
  Requirements	
  
Store	
  all	
  the	
  data	
  
Make	
  the	
  data	
  
accessible	
  for	
  
processing	
  
Compress	
  the	
  
data	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Case	
  Study	
  Requirements	
  
Data	
  Processing	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Processing	
  requirements	
  
Be	
  able	
  to	
  answer	
  ques8ons	
  like:	
  
•  What	
  is	
  my	
  website’s	
  bounce	
  rate?	
  
• i.e.	
  how	
  many	
  %	
  of	
  visitors	
  don’t	
  go	
  past	
  the	
  landing	
  page?	
  
•  Which	
  marke8ng	
  channels	
  are	
  leading	
  to	
  most	
  sessions?	
  
•  Do	
  a^ribu8on	
  analysis	
  
• Which	
  channels	
  are	
  responsible	
  for	
  most	
  conversions?	
  
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Sessioniza8on	
  
Website	
  visit	
  
Visitor	
  1	
  
Session	
  1	
  
Visitor	
  1	
  
Session	
  2	
  
Visitor	
  2	
  
Session	
  1	
  
>	
  30	
  minutes	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Case	
  Study	
  Requirements	
  
Orchestra8on	
  
38	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Orchestra8on	
  is	
  simple	
  
We	
  just	
  need	
  to	
  execute	
  ac8ons	
  
One	
  a]er	
  another	
  
39	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Actually,	
  	
  
we	
  also	
  need	
  to	
  handle	
  errors	
  
And	
  user	
  no8fica8ons	
  
….	
  
	
  
40	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
And…	
  
•  Re-­‐start	
  workflows	
  a]er	
  errors	
  
•  Reuse	
  of	
  ac8ons	
  in	
  mul8ple	
  workflows	
  
•  Complex	
  workflows	
  with	
  decision	
  points	
  
•  Trigger	
  ac8ons	
  based	
  on	
  events	
  
•  Tracking	
  metadata	
  	
  
•  Integra8on	
  with	
  enterprise	
  so]ware	
  
•  Data	
  lifecycle	
  
•  Data	
  quality	
  control	
  
•  Reports	
  
41	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
OK,	
  maybe	
  we	
  need	
  a	
  product	
  
To	
  help	
  us	
  do	
  all	
  that	
  
	
  
42	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Data	
  Modeling	
  
43	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Modeling	
  Considera8ons	
  
•  We	
  need	
  to	
  consider	
  the	
  following	
  in	
  our	
  architecture:	
  
• Storage	
  layer	
  –	
  HDFS?	
  HBase?	
  Etc.	
  
• File	
  system	
  schemas	
  –	
  how	
  will	
  we	
  lay	
  out	
  the	
  data?	
  
• File	
  formats	
  –	
  what	
  storage	
  formats	
  to	
  use	
  for	
  our	
  data,	
  both	
  raw	
  and	
  
processed	
  data?	
  
• Data	
  compression	
  formats?	
  
44	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Data	
  Modeling	
  –	
  Storage	
  Layer	
  
45	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Storage	
  Layer	
  Choices	
  
•  Two	
  likely	
  choices	
  for	
  raw	
  data:	
  
46	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Storage	
  Layer	
  Choices	
  
• Stores	
  data	
  directly	
  as	
  files	
  
• Fast	
  scans	
  
• Poor	
  random	
  reads/writes	
  
• Stores	
  data	
  as	
  Hfiles	
  on	
  HDFS	
  
• Slow	
  scans	
  
• Fast	
  random	
  reads/writes	
  
47	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Storage	
  –	
  Storage	
  Manager	
  Considera8ons	
  
•  Incoming	
  raw	
  data:	
  
• Processing	
  requirements	
  call	
  for	
  batch	
  transforma8ons	
  across	
  mul8ple	
  records	
  
–	
  for	
  example	
  sessioniza8on.	
  
•  Processed	
  data:	
  
• Access	
  to	
  processed	
  data	
  will	
  be	
  via	
  things	
  like	
  analy8cal	
  queries	
  –	
  again	
  
requiring	
  access	
  to	
  mul8ple	
  records.	
  
•  We	
  choose	
  HDFS	
  
• Processing	
  needs	
  in	
  this	
  case	
  served	
  be^er	
  by	
  fast	
  scans.	
  
48	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Data	
  Modeling	
  –	
  Raw	
  Data	
  Storage	
  
49	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Storage	
  Formats	
  –	
  Raw	
  Data	
  and	
  Processed	
  Data	
  
Processed	
  
Data	
  
Raw	
  
	
  Data	
  
50	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Storage	
  –	
  Format	
  Considera8ons	
  	
  
Logs	
  
(plain	
  
text)	
  
51	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Storage	
  –	
  Format	
  Considera8ons	
  	
  
Logs	
  
(plain	
  
text)	
  
Logs	
  
(plain	
  
text)	
  
Logs	
  
(plain	
  
text)	
  
Logs	
  
(plain	
  
text)	
  
Logs	
  
(plain	
  
text)	
  
Logs	
  
(plain	
  
text)	
  
Logs	
  
Logs	
  
Logs	
  
Logs	
  
Logs	
  
52	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Storage	
  –	
  Compression	
  
snappy
Well,	
  maybe.	
  
But	
  not	
  spli^able.	
  
X
Spli^able.	
  Gewng	
  	
  
be^er…	
  
Hmmm….	
  Spli^able,	
  but	
  no...	
  
53	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Raw	
  Data	
  Storage	
  –	
  More	
  About	
  Snappy	
  
• Designed	
  at	
  Google	
  to	
  provide	
  high	
  compression	
  speeds	
  with	
  reasonable	
  
compression.	
  
• Not	
  the	
  highest	
  compression,	
  but	
  provides	
  very	
  good	
  performance	
  for	
  
processing	
  on	
  Hadoop.	
  
• Snappy	
  is	
  not	
  spli^able	
  though,	
  which	
  brings	
  us	
  to…	
  
54	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Hadoop	
  File	
  Types	
  
•  Formats	
  designed	
  specifically	
  to	
  store	
  and	
  process	
  data	
  on	
  Hadoop:	
  
• File	
  based	
  –	
  SequenceFile	
  
• Serializa8on	
  formats	
  –	
  Thri],	
  Protocol	
  Buffers,	
  Avro	
  
• Columnar	
  formats	
  –	
  RCFile,	
  ORC,	
  Parquet	
  
55	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
SequenceFile	
  
• Stores	
  records	
  as	
  binary	
  
key/value	
  pairs.	
  
• SequenceFile	
  “blocks”	
  can	
  
be	
  compressed.	
  
• This	
  enables	
  spli^ability	
  
with	
  non-­‐spli^able	
  
compression.	
  	
  	
  
56	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Avro	
  
• Kinda	
  SequenceFile	
  on	
  
Steroids.	
  
• Self-­‐documen8ng	
  –	
  stores	
  
schema	
  in	
  header.	
  
• Provides	
  very	
  efficient	
  
storage.	
  
• Supports	
  spli^able	
  
compression.	
  
57	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Our	
  Format	
  Recommenda8ons	
  for	
  Raw	
  Data…	
  
•  Avro	
  with	
  Snappy	
  
• Snappy	
  provides	
  op8mized	
  compression.	
  
• Avro	
  provides	
  compact	
  storage,	
  self-­‐documen8ng	
  files,	
  and	
  supports	
  schema	
  
evolu8on.	
  
• Avro	
  also	
  provides	
  be^er	
  failure	
  handling	
  than	
  other	
  choices.	
  
•  SequenceFiles	
  would	
  also	
  be	
  a	
  good	
  choice,	
  and	
  are	
  directly	
  supported	
  by	
  
inges8on	
  tools	
  in	
  the	
  ecosystem.	
  
• But	
  only	
  supports	
  Java.	
  
58	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
But	
  Note…	
  
•  For	
  simplicity,	
  we’ll	
  use	
  plain	
  text	
  for	
  raw	
  data	
  in	
  our	
  example.	
  
59	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Data	
  Modeling	
  –	
  Processed	
  Data	
  Storage	
  
60	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Storage	
  Formats	
  –	
  Raw	
  Data	
  and	
  Processed	
  Data	
  
Processed	
  
Data	
  
Raw	
  
	
  Data	
  
61	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Access	
  to	
  Processed	
  Data	
  
Column	
  A	
   Column	
  B	
   Column	
  C	
   Column	
  D	
  
Value	
   Value	
   Value	
   Value	
  
Value	
   Value	
   Value	
   Value	
  
Value	
   Value	
   Value	
   Value	
  
Value	
   Value	
   Value	
   Value	
  
Analy8cal	
  
Queries	
  
62	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  Formats	
  
•  Eliminates	
  I/O	
  for	
  columns	
  that	
  are	
  not	
  part	
  of	
  a	
  query.	
  
•  Works	
  well	
  for	
  queries	
  that	
  access	
  a	
  subset	
  of	
  columns.	
  
•  O]en	
  provide	
  be^er	
  compression.	
  
•  These	
  add	
  up	
  to	
  drama8cally	
  improved	
  performance	
  for	
  many	
  queries.	
  
1	
   2014-­‐10-­‐13	
   abc	
  
2	
   2014-­‐10-­‐14	
   def	
  
3	
   2014-­‐10-­‐15	
   ghi	
  
1	
   2	
   3	
  
2014-­‐10-­‐13	
   2014-­‐10-­‐14	
   2014-­‐10-­‐15	
  
abc	
   def	
   ghi	
  
63	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  Choices	
  –	
  RCFile	
  
•  Designed	
  to	
  provide	
  efficient	
  processing	
  for	
  Hive	
  queries.	
  
•  Only	
  supports	
  Java.	
  
•  No	
  Avro	
  support.	
  
•  Limited	
  compression	
  support.	
  
•  Sub-­‐op8mal	
  performance	
  compared	
  to	
  newer	
  columnar	
  formats.	
  
64	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  Choices	
  –	
  ORC	
  
•  A	
  be^er	
  RCFile.	
  
•  Also	
  designed	
  to	
  provide	
  efficient	
  processing	
  of	
  Hive	
  queries.	
  
•  Only	
  supports	
  Java.	
  
65	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  Choices	
  –	
  Parquet	
  
•  Designed	
  to	
  provide	
  efficient	
  processing	
  across	
  Hadoop	
  programming	
  interfaces	
  
–	
  MapReduce,	
  Hive,	
  Impala,	
  Pig.	
  
•  Mul8ple	
  language	
  support	
  –	
  Java,	
  C++	
  
•  Good	
  object	
  model	
  support,	
  including	
  Avro.	
  
•  Broad	
  vendor	
  support.	
  
•  These	
  features	
  make	
  Parquet	
  a	
  good	
  choice	
  for	
  our	
  processed	
  data.	
  
66	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Data	
  Modeling	
  –	
  Schema	
  Design	
  
67	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
HDFS	
  Schema	
  Design	
  –	
  One	
  Recommenda8on	
  
/etl	
  –	
  Data	
  in	
  various	
  stages	
  of	
  ETL	
  workflow	
  
/data	
  –	
  processed	
  data	
  to	
  be	
  shared	
  data	
  with	
  the	
  en8re	
  organiza8on	
  
/tmp	
  –	
  temp	
  data	
  from	
  tools	
  or	
  shared	
  between	
  users	
  
/user/<username>	
  -­‐	
  User	
  specific	
  data,	
  jars,	
  conf	
  files	
  
/app	
  –	
  Everything	
  but	
  data:	
  UDF	
  jars,	
  HQL	
  files,	
  Oozie	
  workflows	
  
68	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Par88oning	
  
•  Split	
  the	
  dataset	
  into	
  smaller	
  consumable	
  chunks.	
  
•  Rudimentary	
  form	
  of	
  “indexing”.	
  Reduces	
  I/O	
  needed	
  to	
  process	
  queries.	
  
	
  
69	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Par88oning	
  
dataset	
  
	
  	
  	
  col=val1/file.txt	
  
	
  	
  	
  col=val2/file.txt	
  
	
  	
  	
  	
  …	
  
	
  	
  	
  col=valn/file.txt	
  
dataset	
  
	
  	
  file1.txt	
  
	
  	
  file2.txt	
  
	
  	
  	
  	
  …	
  
	
  	
  filen.txt	
  
Un-­‐par88oned	
  HDFS	
  
directory	
  structure	
  
Par88oned	
  HDFS	
  
directory	
  structure	
  
70	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Par88oning	
  considera8ons	
  
•  What	
  column	
  to	
  par88on	
  by?	
  
• Don’t	
  have	
  too	
  many	
  par88ons	
  (<10,000)	
  
• Don’t	
  have	
  too	
  many	
  small	
  files	
  in	
  the	
  par88ons	
  
• Good	
  to	
  have	
  par88on	
  sizes	
  at	
  least	
  ~1	
  GB,	
  generally	
  a	
  mul8ple	
  of	
  block	
  size.	
  
•  We’ll	
  par88on	
  by	
  6mestamp.	
  This	
  applies	
  to	
  both	
  our	
  raw	
  and	
  processed	
  data.	
  
71	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Par88oning	
  For	
  Our	
  Case	
  Study	
  
•  Raw	
  dataset:	
  
•  /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10!
•  Processed	
  dataset:	
  
•  /data/bikeshop/clickstream/year=2014/month=10/day=10!
72	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Data	
  Inges8on	
  
73	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Omniture	
  data	
  on	
  FTP	
  
•  Apps	
  	
  
•  App	
  Logs	
  
•  RDBMS	
  
Typical	
  Clickstream	
  data	
  sources	
  
74	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Gewng	
  Files	
  from	
  FTP	
  
75	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
curl ftp://myftpsite.com/sitecatalyst/
myreport_2014-10-05.tar.gz
--user name:password | hdfs -put - /etl/clickstream/raw/
sitecatalyst/myreport_2014-10-05.tar.gz
Don’t	
  over-­‐complicate	
  things	
  
76	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  NiFi	
  	
  
77	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Reliable,	
  distributed	
  and	
  highly	
  available	
  systems	
  
That	
  allow	
  streaming	
  events	
  to	
  Hadoop	
  
Event	
  Streaming	
  –	
  Flume	
  and	
  Kaaa	
  
78	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Many	
  available	
  data	
  collec8on	
  sources	
  
•  Well	
  integrated	
  into	
  Hadoop	
  
•  Supports	
  file	
  transforma8ons	
  
•  Can	
  implement	
  complex	
  topologies	
  
•  Very	
  low	
  latency	
  
•  No	
  programming	
  required	
  
Flume:	
  
79	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
“We	
  just	
  want	
  to	
  grab	
  data	
  	
  
from	
  this	
  directory	
  	
  
and	
  write	
  it	
  to	
  HDFS”	
  
	
  
We	
  use	
  Flume	
  when:	
  
80	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Very	
  high-­‐throughput	
  publish-­‐subscribe	
  messaging	
  
•  Highly	
  available	
  
•  Stores	
  data	
  and	
  can	
  replay	
  
•  Can	
  support	
  many	
  consumers	
  with	
  no	
  extra	
  latency	
  
Kaaa	
  is:	
  
81	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
“Kaaa	
  is	
  awesome.	
  	
  
We	
  heard	
  it	
  cures	
  cancer”	
  
Use	
  Kaaa	
  When:	
  
82	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Use	
  Flume	
  with	
  a	
  Kaaa	
  Source	
  
•  Allows	
  to	
  get	
  data	
  from	
  Kaaa,	
  	
  
run	
  some	
  transforma8ons	
  
write	
  to	
  HDFS,	
  HBase	
  or	
  Solr	
  
Actually,	
  why	
  choose?	
  
83	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  We	
  want	
  to	
  ingest	
  events	
  from	
  log	
  files	
  
•  Flume’s	
  Spooling	
  Directory	
  source	
  fits	
  
•  With	
  HDFS	
  Sink	
  
•  We	
  would	
  have	
  used	
  Kaaa	
  if…	
  
• We	
  wanted	
  the	
  data	
  in	
  non-­‐Hadoop	
  systems	
  too	
  
In	
  Our	
  Example…	
  
84	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Sources	
   Interceptors	
   Selectors	
   Channels	
   Sinks	
  
Flume	
  Agent	
  
Short	
  Intro	
  to	
  Flume	
  
Twi^er,	
  logs,	
  JMS,	
  
webserver,	
  Kaaa	
  
Mask,	
  re-­‐format,	
  
validate…	
  
DR,	
  cri8cal	
   Memory,	
  file,	
  Kaaa	
   HDFS,	
  HBase,	
  Solr	
  
85	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Configura8on	
  
• Declara8ve	
  	
  
• No	
  coding	
  required.	
  
• Configura8on	
  specifies	
  how	
  
components	
  are	
  wired	
  
together.	
  
86	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Interceptors	
  
• Mask	
  fields	
  
• Validate	
  informa8on	
  
against	
  external	
  source	
  
• Extract	
  fields	
  
• Modify	
  data	
  format	
  
• Filter	
  or	
  split	
  events	
  
87	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Any	
  sufficiently	
  complex	
  configura8on	
  
Is	
  indis8nguishable	
  from	
  code	
  
88	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  Brief	
  Discussion	
  of	
  Flume	
  Pa^erns	
  –	
  Fan-­‐in	
  
• Flume	
  agent	
  runs	
  on	
  each	
  
of	
  our	
  servers.	
  
• These	
  client	
  agents	
  send	
  
data	
  to	
  mul8ple	
  agents	
  to	
  
provide	
  reliability.	
  
• Flume	
  provides	
  support	
  
for	
  load	
  balancing.	
  
89	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  Brief	
  Discussion	
  of	
  Flume	
  Pa^erns	
  –	
  Spliwng	
  
• Common	
  need	
  is	
  to	
  split	
  
data	
  on	
  ingest.	
  
• For	
  example:	
  
• Sending	
  data	
  to	
  mul8ple	
  
clusters	
  for	
  DR.	
  
• To	
  mul8ple	
  des8na8ons.	
  
• Flume	
  also	
  supports	
  
par88oning,	
  which	
  is	
  key	
  to	
  
our	
  implementa8on.	
  
90	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Flume	
  Agent	
  
	
  
	
  
	
  
	
  
Web	
  Logs	
  
Spooling	
  Dir	
  
Source	
  
Timestamp	
  
Interceptor	
  
	
  File	
  
Channel	
  
Avro	
  Sink	
  
Avro	
  Sink	
  
Avro	
  Sink	
  
Flume	
  Architecture	
  –	
  Client	
  Tier	
  
Web	
  Server	
   Flume	
  Agent	
  
91	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Flume	
  Architecture	
  –	
  Collector	
  Tier	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
	
  
	
  
	
  
	
  
HDFS	
  
Avro	
  
Source	
  
File	
  
Channel	
  
HDFS	
  Sink	
  
HDFS	
  Sink	
  
92	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Add	
  Kaaa	
  producer	
  to	
  our	
  webapp	
  
•  Send	
  clicks	
  and	
  searches	
  as	
  messages	
  
•  Flume	
  can	
  ingest	
  events	
  from	
  Kaaa	
  
•  We	
  can	
  add	
  a	
  second	
  consumer	
  for	
  real-­‐8me	
  
processing	
  in	
  SparkStreaming	
  
•  Another	
  consumer	
  for	
  aler8ng…	
  
•  And	
  maybe	
  a	
  batch	
  consumer	
  too	
  
	
  
What	
  if….	
  We	
  were	
  to	
  use	
  Kaaa?	
  
93	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Channels	
   Sinks	
  
Flume	
  Agent	
  
The	
  Kaaa	
  Channel	
  
Kafka HDFS,	
  HBase,	
  Solr	
  
Producer	
  
A	
  
Producer	
  
B	
  
Producer	
  
C	
  
Kaaa	
  	
  
Producers	
  
94	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Sources	
   Interceptors	
   Channels	
  
Flume	
  Agent	
  
The	
  Kaaa	
  Channel	
  
Twi^er,	
  logs,	
  JMS,	
  
webserver	
  
Mask,	
  re-­‐format,	
  
validate…	
  
Kafka
Consumer	
  A	
  
Kaaa	
  
Consumers	
  
Consumer	
  B	
  
Consumer	
  C	
  
95	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Sources	
   Interceptors	
   Selectors	
   Channels	
   Sinks	
  
Flume	
  Agent	
  
The	
  Kaaa	
  Channel	
  
Twi^er,	
  logs,	
  JMS,	
  
webserver	
  
Mask,	
  re-­‐format,	
  
validate…	
  
DR,	
  cri8cal	
   Kafka HDFS,	
  HBase,	
  Solr	
  
96	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Data	
  Processing	
  –	
  Engines	
  
tiny.cloudera.com/app-­‐arch-­‐slides	
  
97	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Processing	
  Engines	
  
•  MapReduce	
  
•  Abstrac8ons	
  	
  
•  Spark	
  
•  Spark	
  Streaming	
  
•  Impala	
  
98	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Processing	
  Engines	
  
MapReduce	
  and	
  Abstrac8ons	
  
99	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
MapReduce 	
  	
  
•  Oldie	
  but	
  goody	
  	
  
•  Restric8ve	
  Framework	
  /	
  Innovated	
  Work	
  Around	
  
•  Extreme	
  Batch	
  
	
  
100	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
MapReduce	
  Basic	
  High	
  Level	
  	
  
Mapper	
  
HDFS	
  	
  
(Replicated)	
  
Na8ve	
  File	
  System	
  
Block	
  of	
  Data	
  
Temp	
  Spill	
  
Data	
  
Par88oned	
  
Sorted	
  Data	
  
Reducer	
  
Reducer	
  Local	
  
Copy	
  
Output	
  File	
  
101	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
MapReduce	
  Innova8on 	
  	
  
•  Mapper	
  Memory	
  Joins	
  
•  Reducer	
  Memory	
  Joins	
  
•  Buckets	
  Sorted	
  Joins	
  
•  Cross	
  Task	
  Communica8on	
  
•  Windowing	
  
•  And	
  Much	
  More	
  
102	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Abstrac8ons	
  	
  
•  SQL	
  
• Hive	
  
•  Script/Code	
  
• Pig:	
  Pig	
  La8n	
  	
  	
  
• Crunch:	
  Java/Scala	
  
• Cascading:	
  Java/Scala	
  
103	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Processing	
  Engines	
  
Spark	
  and	
  Spark	
  Streaming	
  
104	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  
•  The	
  New	
  Kid	
  that	
  isn’t	
  that	
  New	
  Anymore	
  
•  Easily	
  10x	
  less	
  code	
  
•  Extremely	
  Easy	
  and	
  Powerful	
  API	
  
•  Very	
  good	
  for	
  machine	
  learning	
  
•  Scala,	
  Java,	
  and	
  Python	
  
•  RDDs	
  
•  DAG	
  Engine	
  
105	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  -­‐	
  DAG	
  
106	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  -­‐	
  DAG	
  
Filter	
   KeyBy	
  
KeyBy	
  
TextFile	
  
TextFile	
  
Join	
   Filter	
   Take	
  
107	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  -­‐	
  DAG	
  
Filter	
   KeyBy	
  
KeyBy	
  
TextFile	
  
TextFile	
  
Join	
   Filter	
   Take	
  
Good	
  
Good	
  
Good	
  
Good	
  
Good	
  
Good	
  
Good-­‐Replay	
  
Good-­‐Replay	
  
Good-­‐Replay	
  
Good	
  
Good-­‐Replay	
  
Good	
  
Good-­‐Replay	
  
	
  
Lost	
  Block	
  
Replay	
  
Good-­‐Replay	
  
	
  
Lost	
  Block	
  
Good	
  
Future	
  
Future	
  
Future	
  
Future	
  
108	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  Streaming	
  
•  Calling	
  Spark	
  in	
  a	
  Loop	
  
•  Extends	
  RDDs	
  with	
  DStream	
  
•  Very	
  Li^le	
  Code	
  Changes	
  from	
  ETL	
  to	
  Streaming	
  
109	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  Streaming	
  
Single	
  Pass	
  
Source	
   Receiver	
   RDD	
  
Source	
   Receiver	
   RDD	
  
RDD	
  
Filter	
   Count	
   Print	
  
Source	
   Receiver	
   RDD	
  
RDD	
  
RDD	
  
Single	
  Pass	
  
Filter	
   Count	
   Print	
  
Pre-­‐first	
  	
  
Batch	
  
First	
  	
  
Batch	
  
Second	
  	
  
Batch	
  
110	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  Streaming	
  
Single	
  Pass	
  
Source	
   Receiver	
   RDD	
  
Source	
   Receiver	
   RDD	
  
RDD	
  
Filter	
   Count	
  
Print	
  
Source	
   Receiver	
   RDD	
  
RDD	
  
RDD	
  
Single	
  Pass	
  
Filter	
   Count	
  
Pre-­‐first	
  	
  
Batch	
  
First	
  	
  
Batch	
  
Second	
  	
  
Batch	
  
Stateful	
  RDD	
  1	
  
Print	
  
Stateful	
  RDD	
  2	
  
Stateful	
  RDD	
  1	
  
111	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Processing	
  Engines	
  
Impala	
  
112	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Impala	
  
•  MPP	
  Style	
  SQL	
  Engine	
  on	
  top	
  of	
  Hadoop	
  
•  Very	
  Fast	
  
•  High	
  Concurrency	
  
•  Analy8cal	
  windowing	
  func8ons	
  (C5.2).	
  
113	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Impala	
  –	
  Broadcast	
  Join	
  
Impala	
  Daemon	
  
Smaller	
  Table	
  
Data	
  Block	
  
100%	
  Cached	
  
Smaller	
  Table	
  
Smaller	
  Table	
  
Data	
  Block	
  
Impala	
  Daemon	
  
100%	
  Cached	
  
Smaller	
  Table	
  
Impala	
  Daemon	
  
100%	
  Cached	
  
Smaller	
  Table	
  
Impala	
  Daemon	
  
Hash	
  Join	
  Func8on	
  
Bigger	
  Table	
  Data	
  
Block	
  
100%	
  Cached	
  
Smaller	
  Table	
  
Output	
  
Impala	
  Daemon	
  
Hash	
  Join	
  Func8on	
  
Bigger	
  Table	
  Data	
  
Block	
  
100%	
  Cached	
  
Smaller	
  Table	
  
Output	
  
Impala	
  Daemon	
  
Hash	
  Join	
  Func8on	
  
Bigger	
  Table	
  Data	
  
Block	
  
100%	
  Cached	
  
Smaller	
  Table	
  
Output	
  
114	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Impala	
  –	
  Par88oned	
  Hash	
  Join	
  
Impala	
  Daemon	
  
Smaller	
  Table	
  
Data	
  Block	
  
~33%	
  Cached	
  
Smaller	
  Table	
  
Smaller	
  Table	
  
Data	
  Block	
  
Impala	
  Daemon	
  
~33%	
  Cached	
  
Smaller	
  Table	
  
Impala	
  Daemon	
  
~33%	
  Cached	
  
Smaller	
  Table	
  
Hash	
  Par88oner	
   Hash	
  Par88oner	
  
Impala	
  Daemon	
  
BiggerTable	
  Data	
  
Block	
  
Impala	
  Daemon	
   Impala	
  Daemon	
  
Hash	
  Par88oner	
  
Hash	
  Join	
  Func8on	
  
33%	
  Cached	
  
Smaller	
  Table	
  
Hash	
  Join	
  Func8on	
  
33%	
  Cached	
  
Smaller	
  Table	
  
Hash	
  Join	
  Func8on	
  
33%	
  Cached	
  
Smaller	
  Table	
  
Output	
   Output	
   Output	
  
BiggerTable	
  Data	
  
Block	
  
Hash	
  Par88oner	
  
BiggerTable	
  Data	
  
Block	
  
Hash	
  Par88oner	
  
115	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Impala	
  vs	
  Hive	
  
•  Very	
  different	
  approaches	
  and	
  	
  
•  We	
  may	
  see	
  convergence	
  at	
  some	
  point	
  
•  But	
  for	
  now	
  
• Impala	
  for	
  speed	
  
• Hive	
  for	
  batch	
  
116	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Data	
  Processing	
  –	
  Pa^erns	
  and	
  Recommenda8ons	
  
117	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  processing	
  needs	
  to	
  happen?	
  
•  Sessioniza8on	
  
•  Filtering	
  
•  Deduplica8on	
  
•  BI	
  /	
  Discovery	
  
118	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Sessioniza8on	
  
Website	
  visit	
  
Visitor	
  1	
  
Session	
  1	
  
Visitor	
  1	
  
Session	
  2	
  
Visitor	
  2	
  
Session	
  1	
  
>	
  30	
  minutes	
  
119	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  sessionize?	
  
Helps	
  answers	
  ques8ons	
  like:	
  
•  What	
  is	
  my	
  website’s	
  bounce	
  rate?	
  
• i.e.	
  how	
  many	
  %	
  of	
  visitors	
  don’t	
  go	
  past	
  the	
  landing	
  page?	
  
•  Which	
  marke8ng	
  channels	
  (e.g.	
  organic	
  search,	
  display	
  ad,	
  etc.)	
  are	
  leading	
  to	
  
most	
  sessions?	
  
• Which	
  ones	
  of	
  those	
  lead	
  to	
  most	
  conversions	
  (e.g.	
  people	
  buying	
  things,	
  
signing	
  up,	
  etc.)	
  
•  Do	
  a^ribu8on	
  analysis	
  –	
  which	
  channels	
  are	
  responsible	
  for	
  most	
  conversions?	
  
120	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Sessioniza8on	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12+1413580110
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/
1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5;
en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0
Mobile Safari/533.1” 244.157.45.12+1413583199
121	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  to	
  Sessionize?	
  
1.  Given	
  a	
  list	
  of	
  clicks,	
  determine	
  which	
  clicks	
  came	
  from	
  
the	
  same	
  user	
  (Par88oning,	
  ordering)	
  
2.  Given	
  a	
  par8cular	
  user's	
  clicks,	
  determine	
  if	
  a	
  given	
  click	
  
is	
  a	
  part	
  of	
  a	
  new	
  session	
  or	
  a	
  con8nua8on	
  of	
  the	
  
previous	
  session	
  (Iden8fying	
  session	
  boundaries)	
  
122	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
#1	
  –	
  Which	
  clicks	
  are	
  from	
  same	
  user?	
  
•  We	
  can	
  use:	
  
• IP	
  address	
  (244.157.45.12)	
  
• Cookies	
  (A9A3BECE0563982D)	
  
• IP	
  address	
  (244.157.45.12)and	
  user	
  agent	
  string	
  ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")	
  
123	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
#1	
  –	
  Which	
  clicks	
  are	
  from	
  same	
  user?	
  
•  We	
  can	
  use:	
  
• IP	
  address	
  (244.157.45.12)	
  
• Cookies	
  (A9A3BECE0563982D)	
  
• IP	
  address	
  (244.157.45.12)and	
  user	
  agent	
  string	
  ((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")	
  
124	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
#1	
  –	
  Which	
  clicks	
  are	
  from	
  same	
  user?	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
125	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
#2	
  –	
  Which	
  clicks	
  	
  part	
  of	
  the	
  same	
  session?	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
>	
  30	
  mins	
  apart	
  =	
  different	
  sessions	
  
126	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Sessioniza8on	
  engine	
  recommenda8on	
  
•  We	
  have	
  sessioniza8on	
  code	
  in	
  MR	
  and	
  Spark	
  on	
  github.	
  The	
  complexity	
  of	
  the	
  
code	
  varies,	
  depends	
  on	
  the	
  exper8se	
  in	
  the	
  organiza8on.	
  
•  We	
  choose	
  MR	
  
• MR	
  API	
  is	
  stable	
  and	
  widely	
  known	
  
• No	
  Spark	
  +	
  Oozie	
  (orchestra8on	
  engine)	
  integra8on	
  currently	
  
127	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Filtering	
  –	
  filter	
  out	
  incomplete	
  records	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
128	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Filtering	
  –	
  filter	
  out	
  records	
  from	
  bots/spiders	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
Google	
  spider	
  IP	
  address	
  
129	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Filtering	
  recommenda8on	
  
•  Bot/Spider	
  filtering	
  can	
  be	
  done	
  easily	
  in	
  any	
  of	
  the	
  engines	
  
•  Incomplete	
  records	
  are	
  harder	
  to	
  filter	
  in	
  schema	
  systems	
  like	
  Hive,	
  Impala,	
  
Pig,	
  etc.	
  
•  Flume	
  interceptors	
  can	
  also	
  be	
  used	
  
•  Pre^y	
  close	
  choice	
  between	
  MR,	
  Hive	
  and	
  Spark	
  
•  Can	
  be	
  done	
  in	
  Spark	
  using	
  rdd.filter()	
  
•  We	
  can	
  simply	
  embed	
  this	
  in	
  our	
  MR	
  sessioniza8on	
  job	
  
130	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Deduplica8on	
  –	
  remove	
  duplicate	
  records	
  
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
131	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Deduplica8on	
  recommenda8on	
  
•  Can	
  be	
  done	
  in	
  all	
  engines.	
  
•  We	
  already	
  have	
  a	
  Hive	
  table	
  with	
  all	
  the	
  columns,	
  a	
  simple	
  DISTINCT	
  query	
  
will	
  perform	
  deduplica8on	
  
•  reduce()	
  in	
  spark	
  
•  We	
  use	
  Pig	
  
132	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
BI/Discovery	
  engine	
  recommenda8on	
  
•  Main	
  requirements	
  for	
  this	
  are:	
  
• Low	
  latency	
  
• SQL	
  interface	
  (e.g.	
  JDBC/ODBC)	
  
• Users	
  don’t	
  know	
  how	
  to	
  code	
  
•  We	
  chose	
  Impala	
  
• It’s	
  a	
  SQL	
  engine	
  
• Much	
  faster	
  than	
  other	
  engines	
  
• Provides	
  standard	
  JDBC/ODBC	
  interfaces	
  
133	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
End-­‐to-­‐end	
  processing	
  
Sessioniza8on	
  Filtering	
  Deduplica8on	
  
BI	
  tools	
  
134	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Architectural	
  Considera8ons	
  
Orchestra8on	
  
135	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Data	
  arrives	
  through	
  Flume	
  
•  Triggers	
  a	
  processing	
  event:	
  
• Sessionize	
  
• Enrich	
  –	
  Loca8on,	
  marke8ng	
  channel…	
  
• Store	
  as	
  Parquet	
  
•  Each	
  day	
  we	
  process	
  events	
  from	
  the	
  previous	
  day	
  
Orchestra8ng	
  Clickstream	
  
136	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Workflow	
  is	
  fairly	
  simple	
  
•  Need	
  to	
  trigger	
  workflow	
  based	
  on	
  data	
  
•  Be	
  able	
  to	
  recover	
  from	
  errors	
  
•  Perhaps	
  no8fy	
  on	
  the	
  status	
  
•  And	
  collect	
  metrics	
  for	
  repor8ng	
  
Choosing	
  Right	
  
137	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Oozie	
  or	
  Azkaban?	
  
138	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Oozie	
  Architecture	
  
139	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Part	
  of	
  all	
  major	
  Hadoop	
  distribu8ons	
  
•  Hue	
  integra8on	
  
•  Built	
  -­‐in	
  ac8ons	
  –	
  Hive,	
  Sqoop,	
  MapReduce,	
  SSH	
  
•  Complex	
  workflows	
  with	
  decisions	
  
•  Event	
  and	
  8me	
  based	
  scheduling	
  
•  No8fica8ons	
  
•  SLA	
  Monitoring	
  
•  REST	
  API	
  
Oozie	
  features	
  
140	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Overhead	
  in	
  launching	
  jobs	
  
•  Steep	
  learning	
  curve	
  
•  XML	
  Workflows	
  
Oozie	
  Drawbacks	
  
141	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Azkaban	
  Architecture	
  
	
  
	
  
Azkaban	
  Executor	
  Server	
  
	
  
	
  
Azkaban	
  Web	
  Server	
  
HDFS	
  viewer	
  
plugin	
  
Job	
  types	
  plugin	
  
MySQL	
  
Client	
  
Hadoop	
  
142	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Simplicity	
  
•  Great	
  UI	
  –	
  including	
  pluggable	
  visualizers	
  
•  Lots	
  of	
  plugins	
  –	
  Hive,	
  Pig…	
  
•  Repor8ng	
  plugin	
  
Azkaban	
  features	
  
143	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Doesn’t	
  support	
  workflow	
  decisions	
  
•  Can’t	
  represent	
  data	
  dependency	
  
Azkaban	
  Limita8ons	
  
144	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Workflow	
  is	
  fairly	
  simple	
  
•  Need	
  to	
  trigger	
  workflow	
  based	
  on	
  data	
  
•  Be	
  able	
  to	
  recover	
  from	
  errors	
  
•  Perhaps	
  no8fy	
  on	
  the	
  status	
  
•  And	
  collect	
  metrics	
  for	
  repor8ng	
  
Choosing…	
  
Easier	
  in	
  Oozie	
  
145	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Workflow	
  is	
  fairly	
  simple	
  
•  Need	
  to	
  trigger	
  workflow	
  based	
  on	
  data	
  
•  Be	
  able	
  to	
  recover	
  from	
  errors	
  
•  Perhaps	
  no8fy	
  on	
  the	
  status	
  
•  And	
  collect	
  metrics	
  for	
  repor8ng	
  
Choosing	
  the	
  right	
  Orchestra8on	
  Tool	
  
Be^er	
  in	
  Azkaban	
  
146	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  best	
  orchestra8on	
  tool	
  
is	
  the	
  one	
  you	
  are	
  an	
  expert	
  on	
  
Important	
  Decision	
  Considera8on!	
  
147	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Orchestra8on	
  Pa^erns	
  –	
  Fan	
  Out	
  
148	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Capture	
  &	
  Decide	
  Pa^ern	
  
149	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Puwng	
  It	
  All	
  Together	
  
Final	
  Architecture	
  
150	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
Data	
  
Sources	
  
Inges8on	
  
Raw	
  Data	
  
Storage	
  
(Formats,	
  
Schema)	
  
Processed	
  
Data	
  Storage	
  
(Formats,	
  
Schema)	
  
Processing	
  
Data	
  
Consump8on	
  
Orchestra8on	
  
(Scheduling,	
  
Managing,	
  
Monitoring)	
  
151	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
Data	
  
Sources	
  
Inges8on	
  
Raw	
  Data	
  
Storage	
  
(Formats,	
  
Schema)	
  
Processed	
  
Data	
  Storage	
  
(Formats,	
  
Schema)	
  
Processing	
  
Data	
  
Consump8on	
  
Orchestra8on	
  
(Scheduling,	
  
Managing,	
  
Monitoring)	
  
152	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Final	
  Architecture	
  –	
  Inges8on/Storage	
  
Web	
  Server	
   Flume	
  Agent	
  
Web	
  Server	
   Flume	
  Agent	
  
Web	
  Server	
   Flume	
  Agent	
  
Web	
  Server	
   Flume	
  Agent	
  
Web	
  Server	
   Flume	
  Agent	
  
Web	
  Server	
   Flume	
  Agent	
  
Web	
  Server	
   Flume	
  Agent	
  
Web	
  Server	
   Flume	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Flume	
  Agent	
  
Fan-­‐in	
  	
  
Pa^ern	
  
Mul8	
  Agents	
  for	
  	
  
Failover	
  and	
  rolling	
  restarts	
  
HDFS	
  	
  
/etl/BI/casualcyclist/
clicks/rawlogs/
year=2014/month=10/
day=10!
153	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
Data	
  
Sources	
  
Inges8on	
  
Raw	
  Data	
  
Storage	
  
(Formats,	
  
Schema)	
  
Processed	
  
Data	
  Storage	
  
(Formats,	
  
Schema)	
  
Processing	
  
Data	
  
Consump8on	
  
Orchestra8on	
  
(Scheduling,	
  
Managing,	
  
Monitoring)	
  
154	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Final	
  Architecture	
  –	
  Processing	
  and	
  Storage	
  
/etl/BI/
casualcyclist/clicks/
rawlogs/year=2014/
month=10/day=10	
  
…	
  
dedup-­‐>filtering-­‐
>sessioniza8on	
  
/data/bikeshop/
clickstream/
year=2014/
month=10/day=10	
  
…	
  
parque8ze	
  
155	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Final	
  Architecture	
  –	
  High	
  Level	
  Overview	
  
Data	
  
Sources	
  
Inges8on	
  
Raw	
  Data	
  
Storage	
  
(Formats,	
  
Schema)	
  
Processed	
  
Data	
  Storage	
  
(Formats,	
  
Schema)	
  
Processing	
  
Data	
  
Consump8on	
  
Orchestra8on	
  
(Scheduling,	
  
Managing,	
  
Monitoring)	
  
156	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Final	
  Architecture	
  –	
  Data	
  Access	
  
Hive/
Impala	
  
BI/Analy8cs	
  
Tools	
  
DWH	
  
Sqoop	
  
Local	
  Disk	
   R,	
  etc.	
  
DB	
  import	
  tool	
  
JDBC/ODBC	
  
157	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Demo	
  
158	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Join	
  the	
  Discussion	
  
	
  	
  
Get	
  community	
  
help	
  or	
  provide	
  
feedback	
  
	
  
cloudera.com/community	
  
159	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  Try	
  Cloudera	
  Live	
  today—cloudera.com/live	
  
160	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
hadooparchitecturebook.com	
  
8ny.cloudera.com/app-­‐arch-­‐slides	
  
	
  

More Related Content

What's hot

Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016StampedeCon
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 

What's hot (20)

Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 

Viewers also liked

Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Patricia Yurena: la 1ª Miss España abiertamente Lesbiana
Patricia Yurena: la 1ª Miss España abiertamente LesbianaPatricia Yurena: la 1ª Miss España abiertamente Lesbiana
Patricia Yurena: la 1ª Miss España abiertamente LesbianaValkyrja5
 
Portafolio alvaro romero
Portafolio alvaro romeroPortafolio alvaro romero
Portafolio alvaro romeroalvjor
 
All Events in City - VIBRANT START-UP CHALLENGE
All Events in City - VIBRANT START-UP CHALLENGEAll Events in City - VIBRANT START-UP CHALLENGE
All Events in City - VIBRANT START-UP CHALLENGEAmit Panchal
 
Instructions DOCTER®sight Red Dot | Optics Trade
Instructions DOCTER®sight Red Dot | Optics TradeInstructions DOCTER®sight Red Dot | Optics Trade
Instructions DOCTER®sight Red Dot | Optics TradeOptics-Trade
 
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)Rodrigo Lema González
 

Viewers also liked (14)

Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Patricia Yurena: la 1ª Miss España abiertamente Lesbiana
Patricia Yurena: la 1ª Miss España abiertamente LesbianaPatricia Yurena: la 1ª Miss España abiertamente Lesbiana
Patricia Yurena: la 1ª Miss España abiertamente Lesbiana
 
Portafolio alvaro romero
Portafolio alvaro romeroPortafolio alvaro romero
Portafolio alvaro romero
 
All Events in City - VIBRANT START-UP CHALLENGE
All Events in City - VIBRANT START-UP CHALLENGEAll Events in City - VIBRANT START-UP CHALLENGE
All Events in City - VIBRANT START-UP CHALLENGE
 
Instructions DOCTER®sight Red Dot | Optics Trade
Instructions DOCTER®sight Red Dot | Optics TradeInstructions DOCTER®sight Red Dot | Optics Trade
Instructions DOCTER®sight Red Dot | Optics Trade
 
Ensayo
EnsayoEnsayo
Ensayo
 
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
Perspectiva Fotográfica: Marlins vs. Phillies (23 de Septiembre de 2014)
 

Similar to Hadoop Application Architectures tutorial at Big DataService 2015

Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Cloudera, Inc.
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Unlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQLUnlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQLMatt Lord
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18Cloudera, Inc.
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!KNIMESlides
 
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMEBig Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMERosaria Silipo
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and ManufacturingCloudera, Inc.
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...DataStax Academy
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
 

Similar to Hadoop Application Architectures tutorial at Big DataService 2015 (20)

Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Unlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQLUnlocking Big Data Insights with MySQL
Unlocking Big Data Insights with MySQL
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
 
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMEBig Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Hadoop Application Architectures tutorial at Big DataService 2015

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   for  Hadoop  Applica8ons   Mark  Grover  |  @mark_grover   Jonathan  Seidman  |  @jseidman     Gwen  Shapira  |  @gwenshap   IEEE  BigDataService  2015  |  March  30th,  2015   8ny.cloudera.com/app-­‐arch-­‐slides  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   About  the  book   •  @hadooparchbook   •  hadooparchitecturebook.com   •  github.com/hadooparchitecturebook   •  slideshare.com/hadooparchbook  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   About  the  presenters   • Principal  Solu8ons  Architect   at  Cloudera   • Previously,  lead  architect  at   FINRA   • Contributor  to  Apache   Hadoop,  HBase,  Flume,   Avro,  Pig  and  Spark   • Senior  Solu8ons  Architect/ Partner  Enablement  at   Cloudera   • Previously,  Technical  Lead   on  the  big  data  team  at   Orbitz  Worldwide   • Co-­‐founder  of  the  Chicago   Hadoop  User  Group  and   Chicago  Big  Data   Ted  Malaska   Jonathan  Seidman  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   About  the  presenters   • Solu8ons  Architect  turned   So]ware  Engineer  at   Cloudera   • Commi^er  on  Apache   Sqoop   • Contributor  to  Apache   Flume  and  Apache  Kaaa   • So]ware  Engineer  at   Cloudera   • Commi^er  on  Apache   Bigtop,  PMC  member  on   Apache  Sentry  (incuba8ng)   • Contributor  to  Apache   Hadoop,  Spark,  Hive,  Sqoop,   Pig  and  Flume   Gwen  Shapira   Mark  Grover  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Logis8cs   •  Break  at  10:30-­‐11:00  PM   •  Ques8ons  at  the  end  of  each  sec8on  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Case  Study   Clickstream  Analysis  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Analy8cs  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Analy8cs  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Analy8cs  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Analy8cs  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Analy8cs  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Analy8cs  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Analy8cs  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Web  Logs  –  Combined  Log  Format   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Clickstream  Analy8cs   244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36”
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Similar  use-­‐cases   •  Sensors  –  heart,  agriculture,  etc.   •  Casinos  –  session  of  a  person  at  a  table  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Pre-­‐Hadoop  Architecture   Clickstream  Analysis  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Click  Stream  Analysis  (Before  Hadoop)   Web  logs   (full  fidelity)   (2  weeks)   Data  Warehouse   Transform/ Aggregate   Business   Intelligence   Tape  Archive  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Problems  with  Pre-­‐Hadoop  Architecture   •  Full  fidelity  data  is  stored  for  small  amount  of  8me  (~weeks).   •  Older  data  is  sent  to  tape,  or  even  worse,  deleted!   •  Inflexible  workflow  -­‐  think  of  all  aggrega8ons  beforehand  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Effects  of  Pre-­‐Hadoop  Architecture   •  Regenera8ng  aggregates  is  expensive  or  worse,  impossible   •  Can’t  correct  bugs  in  the  workflow/aggrega8on  logic   •  Can’t  do  experiments  on  exis8ng  data  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Why  is  Hadoop  A  Great  Fit?   Clickstream  Analysis  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Why  is  Hadoop  a  great  fit?     •  Volume  of  clickstream  data  is  huge   •  Velocity  at  which  it  comes  in  is  high   •  Variety  of  data  is  diverse  -­‐  semi-­‐structured  data   •  Hadoop  enables   • ac8ve  archival  of  data   • Aggrega8on  jobs   • Querying  the  above  aggregates  or  raw  fidelity  data  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Click  Stream  Analysis  (with  Hadoop)   Web  logs   Hadoop   Business   Intelligence   Ac8ve  archive  (no  tape)   Aggrega8on  engine   Querying  engine  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Challenges  of  Hadoop  Implementa8on  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Challenges  of  Hadoop  Implementa8on  
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Other  challenges  -­‐  Architectural  Considera8ons     •  Storage  managers?   •  HDFS?  HBase?   •  Data  storage  and  modeling:   •  File  formats?  Compression?  Schema  design?   •  Data  movement   •  How  do  we  actually  get  the  data  into  Hadoop?  How  do  we  get  it  out?   •  Metadata   •  How  do  we  manage  data  about  the  data?   •  Data  access  and  processing   •  How  will  the  data  be  accessed  once  in  Hadoop?  How  can  we  transform  it?  How  do  we  query   it?   •  Orchestra8on   •  How  do  we  manage  the  workflow  for  all  of  this?  
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Case  Study  Requirements   Overview  of  Requirements  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Overview  of  Requirements   Data   Sources   Inges8on   Raw  Data   Storage   (Formats,   Schema)   Processed   Data  Storage   (Formats,   Schema)   Processing   Data   Consump8on   Orchestra8on   (Scheduling,   Managing,   Monitoring)  
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Case  Study  Requirements   Data  Inges8on  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Inges8on  Requirements   Web  Servers   Web  Servers   Web  Servers   Web  Servers   Web  Servers   Logs     244.157.45.12 - - [17/ Oct/2014:21:08:30 ] "GET /seatposts HTTP/ 1.0" 200 4463 …   CRM  Data   ODS   Web  Servers   Web  Servers   Web  Servers   Web  Servers   Web  Servers   Logs     Hadoop  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Inges8on  Requirements   •  So  we  need  to  be  able  to  support:   • Reliable  inges8on  of  large  volumes  of  semi-­‐structured  event  data  arriving  with   high  velocity  (e.g.  logs).   • Timeliness  of  data  availability  –  data  needs  to  be  available  for  processing  to   meet  business  service  level  agreements.   • Periodic  inges8on  of  data  from  rela8onal  data  stores.  
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Case  Study  Requirements   Data  Storage  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Storage  Requirements   Store  all  the  data   Make  the  data   accessible  for   processing   Compress  the   data  
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Case  Study  Requirements   Data  Processing  
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Processing  requirements   Be  able  to  answer  ques8ons  like:   •  What  is  my  website’s  bounce  rate?   • i.e.  how  many  %  of  visitors  don’t  go  past  the  landing  page?   •  Which  marke8ng  channels  are  leading  to  most  sessions?   •  Do  a^ribu8on  analysis   • Which  channels  are  responsible  for  most  conversions?  
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Sessioniza8on   Website  visit   Visitor  1   Session  1   Visitor  1   Session  2   Visitor  2   Session  1   >  30  minutes  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Case  Study  Requirements   Orchestra8on  
  • 38. 38  ©  Cloudera,  Inc.  All  rights  reserved.   Orchestra8on  is  simple   We  just  need  to  execute  ac8ons   One  a]er  another  
  • 39. 39  ©  Cloudera,  Inc.  All  rights  reserved.   Actually,     we  also  need  to  handle  errors   And  user  no8fica8ons   ….    
  • 40. 40  ©  Cloudera,  Inc.  All  rights  reserved.   And…   •  Re-­‐start  workflows  a]er  errors   •  Reuse  of  ac8ons  in  mul8ple  workflows   •  Complex  workflows  with  decision  points   •  Trigger  ac8ons  based  on  events   •  Tracking  metadata     •  Integra8on  with  enterprise  so]ware   •  Data  lifecycle   •  Data  quality  control   •  Reports  
  • 41. 41  ©  Cloudera,  Inc.  All  rights  reserved.   OK,  maybe  we  need  a  product   To  help  us  do  all  that    
  • 42. 42  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Data  Modeling  
  • 43. 43  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Modeling  Considera8ons   •  We  need  to  consider  the  following  in  our  architecture:   • Storage  layer  –  HDFS?  HBase?  Etc.   • File  system  schemas  –  how  will  we  lay  out  the  data?   • File  formats  –  what  storage  formats  to  use  for  our  data,  both  raw  and   processed  data?   • Data  compression  formats?  
  • 44. 44  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Data  Modeling  –  Storage  Layer  
  • 45. 45  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Storage  Layer  Choices   •  Two  likely  choices  for  raw  data:  
  • 46. 46  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Storage  Layer  Choices   • Stores  data  directly  as  files   • Fast  scans   • Poor  random  reads/writes   • Stores  data  as  Hfiles  on  HDFS   • Slow  scans   • Fast  random  reads/writes  
  • 47. 47  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Storage  –  Storage  Manager  Considera8ons   •  Incoming  raw  data:   • Processing  requirements  call  for  batch  transforma8ons  across  mul8ple  records   –  for  example  sessioniza8on.   •  Processed  data:   • Access  to  processed  data  will  be  via  things  like  analy8cal  queries  –  again   requiring  access  to  mul8ple  records.   •  We  choose  HDFS   • Processing  needs  in  this  case  served  be^er  by  fast  scans.  
  • 48. 48  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Data  Modeling  –  Raw  Data  Storage  
  • 49. 49  ©  Cloudera,  Inc.  All  rights  reserved.   Storage  Formats  –  Raw  Data  and  Processed  Data   Processed   Data   Raw    Data  
  • 50. 50  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Storage  –  Format  Considera8ons     Logs   (plain   text)  
  • 51. 51  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Storage  –  Format  Considera8ons     Logs   (plain   text)   Logs   (plain   text)   Logs   (plain   text)   Logs   (plain   text)   Logs   (plain   text)   Logs   (plain   text)   Logs   Logs   Logs   Logs   Logs  
  • 52. 52  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Storage  –  Compression   snappy Well,  maybe.   But  not  spli^able.   X Spli^able.  Gewng     be^er…   Hmmm….  Spli^able,  but  no...  
  • 53. 53  ©  Cloudera,  Inc.  All  rights  reserved.   Raw  Data  Storage  –  More  About  Snappy   • Designed  at  Google  to  provide  high  compression  speeds  with  reasonable   compression.   • Not  the  highest  compression,  but  provides  very  good  performance  for   processing  on  Hadoop.   • Snappy  is  not  spli^able  though,  which  brings  us  to…  
  • 54. 54  ©  Cloudera,  Inc.  All  rights  reserved.   Hadoop  File  Types   •  Formats  designed  specifically  to  store  and  process  data  on  Hadoop:   • File  based  –  SequenceFile   • Serializa8on  formats  –  Thri],  Protocol  Buffers,  Avro   • Columnar  formats  –  RCFile,  ORC,  Parquet  
  • 55. 55  ©  Cloudera,  Inc.  All  rights  reserved.   SequenceFile   • Stores  records  as  binary   key/value  pairs.   • SequenceFile  “blocks”  can   be  compressed.   • This  enables  spli^ability   with  non-­‐spli^able   compression.      
  • 56. 56  ©  Cloudera,  Inc.  All  rights  reserved.   Avro   • Kinda  SequenceFile  on   Steroids.   • Self-­‐documen8ng  –  stores   schema  in  header.   • Provides  very  efficient   storage.   • Supports  spli^able   compression.  
  • 57. 57  ©  Cloudera,  Inc.  All  rights  reserved.   Our  Format  Recommenda8ons  for  Raw  Data…   •  Avro  with  Snappy   • Snappy  provides  op8mized  compression.   • Avro  provides  compact  storage,  self-­‐documen8ng  files,  and  supports  schema   evolu8on.   • Avro  also  provides  be^er  failure  handling  than  other  choices.   •  SequenceFiles  would  also  be  a  good  choice,  and  are  directly  supported  by   inges8on  tools  in  the  ecosystem.   • But  only  supports  Java.  
  • 58. 58  ©  Cloudera,  Inc.  All  rights  reserved.   But  Note…   •  For  simplicity,  we’ll  use  plain  text  for  raw  data  in  our  example.  
  • 59. 59  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Data  Modeling  –  Processed  Data  Storage  
  • 60. 60  ©  Cloudera,  Inc.  All  rights  reserved.   Storage  Formats  –  Raw  Data  and  Processed  Data   Processed   Data   Raw    Data  
  • 61. 61  ©  Cloudera,  Inc.  All  rights  reserved.   Access  to  Processed  Data   Column  A   Column  B   Column  C   Column  D   Value   Value   Value   Value   Value   Value   Value   Value   Value   Value   Value   Value   Value   Value   Value   Value   Analy8cal   Queries  
  • 62. 62  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  Formats   •  Eliminates  I/O  for  columns  that  are  not  part  of  a  query.   •  Works  well  for  queries  that  access  a  subset  of  columns.   •  O]en  provide  be^er  compression.   •  These  add  up  to  drama8cally  improved  performance  for  many  queries.   1   2014-­‐10-­‐13   abc   2   2014-­‐10-­‐14   def   3   2014-­‐10-­‐15   ghi   1   2   3   2014-­‐10-­‐13   2014-­‐10-­‐14   2014-­‐10-­‐15   abc   def   ghi  
  • 63. 63  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  Choices  –  RCFile   •  Designed  to  provide  efficient  processing  for  Hive  queries.   •  Only  supports  Java.   •  No  Avro  support.   •  Limited  compression  support.   •  Sub-­‐op8mal  performance  compared  to  newer  columnar  formats.  
  • 64. 64  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  Choices  –  ORC   •  A  be^er  RCFile.   •  Also  designed  to  provide  efficient  processing  of  Hive  queries.   •  Only  supports  Java.  
  • 65. 65  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  Choices  –  Parquet   •  Designed  to  provide  efficient  processing  across  Hadoop  programming  interfaces   –  MapReduce,  Hive,  Impala,  Pig.   •  Mul8ple  language  support  –  Java,  C++   •  Good  object  model  support,  including  Avro.   •  Broad  vendor  support.   •  These  features  make  Parquet  a  good  choice  for  our  processed  data.  
  • 66. 66  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Data  Modeling  –  Schema  Design  
  • 67. 67  ©  Cloudera,  Inc.  All  rights  reserved.   HDFS  Schema  Design  –  One  Recommenda8on   /etl  –  Data  in  various  stages  of  ETL  workflow   /data  –  processed  data  to  be  shared  data  with  the  en8re  organiza8on   /tmp  –  temp  data  from  tools  or  shared  between  users   /user/<username>  -­‐  User  specific  data,  jars,  conf  files   /app  –  Everything  but  data:  UDF  jars,  HQL  files,  Oozie  workflows  
  • 68. 68  ©  Cloudera,  Inc.  All  rights  reserved.   Par88oning   •  Split  the  dataset  into  smaller  consumable  chunks.   •  Rudimentary  form  of  “indexing”.  Reduces  I/O  needed  to  process  queries.    
  • 69. 69  ©  Cloudera,  Inc.  All  rights  reserved.   Par88oning   dataset        col=val1/file.txt        col=val2/file.txt          …        col=valn/file.txt   dataset      file1.txt      file2.txt          …      filen.txt   Un-­‐par88oned  HDFS   directory  structure   Par88oned  HDFS   directory  structure  
  • 70. 70  ©  Cloudera,  Inc.  All  rights  reserved.   Par88oning  considera8ons   •  What  column  to  par88on  by?   • Don’t  have  too  many  par88ons  (<10,000)   • Don’t  have  too  many  small  files  in  the  par88ons   • Good  to  have  par88on  sizes  at  least  ~1  GB,  generally  a  mul8ple  of  block  size.   •  We’ll  par88on  by  6mestamp.  This  applies  to  both  our  raw  and  processed  data.  
  • 71. 71  ©  Cloudera,  Inc.  All  rights  reserved.   Par88oning  For  Our  Case  Study   •  Raw  dataset:   •  /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10! •  Processed  dataset:   •  /data/bikeshop/clickstream/year=2014/month=10/day=10!
  • 72. 72  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Data  Inges8on  
  • 73. 73  ©  Cloudera,  Inc.  All  rights  reserved.   •  Omniture  data  on  FTP   •  Apps     •  App  Logs   •  RDBMS   Typical  Clickstream  data  sources  
  • 74. 74  ©  Cloudera,  Inc.  All  rights  reserved.   Gewng  Files  from  FTP  
  • 75. 75  ©  Cloudera,  Inc.  All  rights  reserved.   curl ftp://myftpsite.com/sitecatalyst/ myreport_2014-10-05.tar.gz --user name:password | hdfs -put - /etl/clickstream/raw/ sitecatalyst/myreport_2014-10-05.tar.gz Don’t  over-­‐complicate  things  
  • 76. 76  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  NiFi    
  • 77. 77  ©  Cloudera,  Inc.  All  rights  reserved.   Reliable,  distributed  and  highly  available  systems   That  allow  streaming  events  to  Hadoop   Event  Streaming  –  Flume  and  Kaaa  
  • 78. 78  ©  Cloudera,  Inc.  All  rights  reserved.   •  Many  available  data  collec8on  sources   •  Well  integrated  into  Hadoop   •  Supports  file  transforma8ons   •  Can  implement  complex  topologies   •  Very  low  latency   •  No  programming  required   Flume:  
  • 79. 79  ©  Cloudera,  Inc.  All  rights  reserved.   “We  just  want  to  grab  data     from  this  directory     and  write  it  to  HDFS”     We  use  Flume  when:  
  • 80. 80  ©  Cloudera,  Inc.  All  rights  reserved.   •  Very  high-­‐throughput  publish-­‐subscribe  messaging   •  Highly  available   •  Stores  data  and  can  replay   •  Can  support  many  consumers  with  no  extra  latency   Kaaa  is:  
  • 81. 81  ©  Cloudera,  Inc.  All  rights  reserved.   “Kaaa  is  awesome.     We  heard  it  cures  cancer”   Use  Kaaa  When:  
  • 82. 82  ©  Cloudera,  Inc.  All  rights  reserved.   •  Use  Flume  with  a  Kaaa  Source   •  Allows  to  get  data  from  Kaaa,     run  some  transforma8ons   write  to  HDFS,  HBase  or  Solr   Actually,  why  choose?  
  • 83. 83  ©  Cloudera,  Inc.  All  rights  reserved.   •  We  want  to  ingest  events  from  log  files   •  Flume’s  Spooling  Directory  source  fits   •  With  HDFS  Sink   •  We  would  have  used  Kaaa  if…   • We  wanted  the  data  in  non-­‐Hadoop  systems  too   In  Our  Example…  
  • 84. 84  ©  Cloudera,  Inc.  All  rights  reserved.   Sources   Interceptors   Selectors   Channels   Sinks   Flume  Agent   Short  Intro  to  Flume   Twi^er,  logs,  JMS,   webserver,  Kaaa   Mask,  re-­‐format,   validate…   DR,  cri8cal   Memory,  file,  Kaaa   HDFS,  HBase,  Solr  
  • 85. 85  ©  Cloudera,  Inc.  All  rights  reserved.   Configura8on   • Declara8ve     • No  coding  required.   • Configura8on  specifies  how   components  are  wired   together.  
  • 86. 86  ©  Cloudera,  Inc.  All  rights  reserved.   Interceptors   • Mask  fields   • Validate  informa8on   against  external  source   • Extract  fields   • Modify  data  format   • Filter  or  split  events  
  • 87. 87  ©  Cloudera,  Inc.  All  rights  reserved.   Any  sufficiently  complex  configura8on   Is  indis8nguishable  from  code  
  • 88. 88  ©  Cloudera,  Inc.  All  rights  reserved.   A  Brief  Discussion  of  Flume  Pa^erns  –  Fan-­‐in   • Flume  agent  runs  on  each   of  our  servers.   • These  client  agents  send   data  to  mul8ple  agents  to   provide  reliability.   • Flume  provides  support   for  load  balancing.  
  • 89. 89  ©  Cloudera,  Inc.  All  rights  reserved.   A  Brief  Discussion  of  Flume  Pa^erns  –  Spliwng   • Common  need  is  to  split   data  on  ingest.   • For  example:   • Sending  data  to  mul8ple   clusters  for  DR.   • To  mul8ple  des8na8ons.   • Flume  also  supports   par88oning,  which  is  key  to   our  implementa8on.  
  • 90. 90  ©  Cloudera,  Inc.  All  rights  reserved.   Flume  Agent           Web  Logs   Spooling  Dir   Source   Timestamp   Interceptor    File   Channel   Avro  Sink   Avro  Sink   Avro  Sink   Flume  Architecture  –  Client  Tier   Web  Server   Flume  Agent  
  • 91. 91  ©  Cloudera,  Inc.  All  rights  reserved.   Flume  Architecture  –  Collector  Tier   Flume  Agent   Flume  Agent           HDFS   Avro   Source   File   Channel   HDFS  Sink   HDFS  Sink  
  • 92. 92  ©  Cloudera,  Inc.  All  rights  reserved.   •  Add  Kaaa  producer  to  our  webapp   •  Send  clicks  and  searches  as  messages   •  Flume  can  ingest  events  from  Kaaa   •  We  can  add  a  second  consumer  for  real-­‐8me   processing  in  SparkStreaming   •  Another  consumer  for  aler8ng…   •  And  maybe  a  batch  consumer  too     What  if….  We  were  to  use  Kaaa?  
  • 93. 93  ©  Cloudera,  Inc.  All  rights  reserved.   Channels   Sinks   Flume  Agent   The  Kaaa  Channel   Kafka HDFS,  HBase,  Solr   Producer   A   Producer   B   Producer   C   Kaaa     Producers  
  • 94. 94  ©  Cloudera,  Inc.  All  rights  reserved.   Sources   Interceptors   Channels   Flume  Agent   The  Kaaa  Channel   Twi^er,  logs,  JMS,   webserver   Mask,  re-­‐format,   validate…   Kafka Consumer  A   Kaaa   Consumers   Consumer  B   Consumer  C  
  • 95. 95  ©  Cloudera,  Inc.  All  rights  reserved.   Sources   Interceptors   Selectors   Channels   Sinks   Flume  Agent   The  Kaaa  Channel   Twi^er,  logs,  JMS,   webserver   Mask,  re-­‐format,   validate…   DR,  cri8cal   Kafka HDFS,  HBase,  Solr  
  • 96. 96  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Data  Processing  –  Engines   tiny.cloudera.com/app-­‐arch-­‐slides  
  • 97. 97  ©  Cloudera,  Inc.  All  rights  reserved.   Processing  Engines   •  MapReduce   •  Abstrac8ons     •  Spark   •  Spark  Streaming   •  Impala  
  • 98. 98  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Processing  Engines   MapReduce  and  Abstrac8ons  
  • 99. 99  ©  Cloudera,  Inc.  All  rights  reserved.   MapReduce     •  Oldie  but  goody     •  Restric8ve  Framework  /  Innovated  Work  Around   •  Extreme  Batch    
  • 100. 100  ©  Cloudera,  Inc.  All  rights  reserved.   MapReduce  Basic  High  Level     Mapper   HDFS     (Replicated)   Na8ve  File  System   Block  of  Data   Temp  Spill   Data   Par88oned   Sorted  Data   Reducer   Reducer  Local   Copy   Output  File  
  • 101. 101  ©  Cloudera,  Inc.  All  rights  reserved.   MapReduce  Innova8on     •  Mapper  Memory  Joins   •  Reducer  Memory  Joins   •  Buckets  Sorted  Joins   •  Cross  Task  Communica8on   •  Windowing   •  And  Much  More  
  • 102. 102  ©  Cloudera,  Inc.  All  rights  reserved.   Abstrac8ons     •  SQL   • Hive   •  Script/Code   • Pig:  Pig  La8n       • Crunch:  Java/Scala   • Cascading:  Java/Scala  
  • 103. 103  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Processing  Engines   Spark  and  Spark  Streaming  
  • 104. 104  ©  Cloudera,  Inc.  All  rights  reserved.   Spark   •  The  New  Kid  that  isn’t  that  New  Anymore   •  Easily  10x  less  code   •  Extremely  Easy  and  Powerful  API   •  Very  good  for  machine  learning   •  Scala,  Java,  and  Python   •  RDDs   •  DAG  Engine  
  • 105. 105  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  -­‐  DAG  
  • 106. 106  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  -­‐  DAG   Filter   KeyBy   KeyBy   TextFile   TextFile   Join   Filter   Take  
  • 107. 107  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  -­‐  DAG   Filter   KeyBy   KeyBy   TextFile   TextFile   Join   Filter   Take   Good   Good   Good   Good   Good   Good   Good-­‐Replay   Good-­‐Replay   Good-­‐Replay   Good   Good-­‐Replay   Good   Good-­‐Replay     Lost  Block   Replay   Good-­‐Replay     Lost  Block   Good   Future   Future   Future   Future  
  • 108. 108  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  Streaming   •  Calling  Spark  in  a  Loop   •  Extends  RDDs  with  DStream   •  Very  Li^le  Code  Changes  from  ETL  to  Streaming  
  • 109. 109  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  Streaming   Single  Pass   Source   Receiver   RDD   Source   Receiver   RDD   RDD   Filter   Count   Print   Source   Receiver   RDD   RDD   RDD   Single  Pass   Filter   Count   Print   Pre-­‐first     Batch   First     Batch   Second     Batch  
  • 110. 110  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  Streaming   Single  Pass   Source   Receiver   RDD   Source   Receiver   RDD   RDD   Filter   Count   Print   Source   Receiver   RDD   RDD   RDD   Single  Pass   Filter   Count   Pre-­‐first     Batch   First     Batch   Second     Batch   Stateful  RDD  1   Print   Stateful  RDD  2   Stateful  RDD  1  
  • 111. 111  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Processing  Engines   Impala  
  • 112. 112  ©  Cloudera,  Inc.  All  rights  reserved.   Impala   •  MPP  Style  SQL  Engine  on  top  of  Hadoop   •  Very  Fast   •  High  Concurrency   •  Analy8cal  windowing  func8ons  (C5.2).  
  • 113. 113  ©  Cloudera,  Inc.  All  rights  reserved.   Impala  –  Broadcast  Join   Impala  Daemon   Smaller  Table   Data  Block   100%  Cached   Smaller  Table   Smaller  Table   Data  Block   Impala  Daemon   100%  Cached   Smaller  Table   Impala  Daemon   100%  Cached   Smaller  Table   Impala  Daemon   Hash  Join  Func8on   Bigger  Table  Data   Block   100%  Cached   Smaller  Table   Output   Impala  Daemon   Hash  Join  Func8on   Bigger  Table  Data   Block   100%  Cached   Smaller  Table   Output   Impala  Daemon   Hash  Join  Func8on   Bigger  Table  Data   Block   100%  Cached   Smaller  Table   Output  
  • 114. 114  ©  Cloudera,  Inc.  All  rights  reserved.   Impala  –  Par88oned  Hash  Join   Impala  Daemon   Smaller  Table   Data  Block   ~33%  Cached   Smaller  Table   Smaller  Table   Data  Block   Impala  Daemon   ~33%  Cached   Smaller  Table   Impala  Daemon   ~33%  Cached   Smaller  Table   Hash  Par88oner   Hash  Par88oner   Impala  Daemon   BiggerTable  Data   Block   Impala  Daemon   Impala  Daemon   Hash  Par88oner   Hash  Join  Func8on   33%  Cached   Smaller  Table   Hash  Join  Func8on   33%  Cached   Smaller  Table   Hash  Join  Func8on   33%  Cached   Smaller  Table   Output   Output   Output   BiggerTable  Data   Block   Hash  Par88oner   BiggerTable  Data   Block   Hash  Par88oner  
  • 115. 115  ©  Cloudera,  Inc.  All  rights  reserved.   Impala  vs  Hive   •  Very  different  approaches  and     •  We  may  see  convergence  at  some  point   •  But  for  now   • Impala  for  speed   • Hive  for  batch  
  • 116. 116  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Data  Processing  –  Pa^erns  and  Recommenda8ons  
  • 117. 117  ©  Cloudera,  Inc.  All  rights  reserved.   What  processing  needs  to  happen?   •  Sessioniza8on   •  Filtering   •  Deduplica8on   •  BI  /  Discovery  
  • 118. 118  ©  Cloudera,  Inc.  All  rights  reserved.   Sessioniza8on   Website  visit   Visitor  1   Session  1   Visitor  1   Session  2   Visitor  2   Session  1   >  30  minutes  
  • 119. 119  ©  Cloudera,  Inc.  All  rights  reserved.   Why  sessionize?   Helps  answers  ques8ons  like:   •  What  is  my  website’s  bounce  rate?   • i.e.  how  many  %  of  visitors  don’t  go  past  the  landing  page?   •  Which  marke8ng  channels  (e.g.  organic  search,  display  ad,  etc.)  are  leading  to   most  sessions?   • Which  ones  of  those  lead  to  most  conversions  (e.g.  people  buying  things,   signing  up,  etc.)   •  Do  a^ribu8on  analysis  –  which  channels  are  responsible  for  most  conversions?  
  • 120. 120  ©  Cloudera,  Inc.  All  rights  reserved.   Sessioniza8on   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12+1413580110 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/ 1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 244.157.45.12+1413583199
  • 121. 121  ©  Cloudera,  Inc.  All  rights  reserved.   How  to  Sessionize?   1.  Given  a  list  of  clicks,  determine  which  clicks  came  from   the  same  user  (Par88oning,  ordering)   2.  Given  a  par8cular  user's  clicks,  determine  if  a  given  click   is  a  part  of  a  new  session  or  a  con8nua8on  of  the   previous  session  (Iden8fying  session  boundaries)  
  • 122. 122  ©  Cloudera,  Inc.  All  rights  reserved.   #1  –  Which  clicks  are  from  same  user?   •  We  can  use:   • IP  address  (244.157.45.12)   • Cookies  (A9A3BECE0563982D)   • IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")  
  • 123. 123  ©  Cloudera,  Inc.  All  rights  reserved.   #1  –  Which  clicks  are  from  same  user?   •  We  can  use:   • IP  address  (244.157.45.12)   • Cookies  (A9A3BECE0563982D)   • IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")  
  • 124. 124  ©  Cloudera,  Inc.  All  rights  reserved.   #1  –  Which  clicks  are  from  same  user?   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 125. 125  ©  Cloudera,  Inc.  All  rights  reserved.   #2  –  Which  clicks    part  of  the  same  session?   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” >  30  mins  apart  =  different  sessions  
  • 126. 126  ©  Cloudera,  Inc.  All  rights  reserved.   Sessioniza8on  engine  recommenda8on   •  We  have  sessioniza8on  code  in  MR  and  Spark  on  github.  The  complexity  of  the   code  varies,  depends  on  the  exper8se  in  the  organiza8on.   •  We  choose  MR   • MR  API  is  stable  and  widely  known   • No  Spark  +  Oozie  (orchestra8on  engine)  integra8on  currently  
  • 127. 127  ©  Cloudera,  Inc.  All  rights  reserved.   Filtering  –  filter  out  incomplete  records   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
  • 128. 128  ©  Cloudera,  Inc.  All  rights  reserved.   Filtering  –  filter  out  records  from  bots/spiders   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” Google  spider  IP  address  
  • 129. 129  ©  Cloudera,  Inc.  All  rights  reserved.   Filtering  recommenda8on   •  Bot/Spider  filtering  can  be  done  easily  in  any  of  the  engines   •  Incomplete  records  are  harder  to  filter  in  schema  systems  like  Hive,  Impala,   Pig,  etc.   •  Flume  interceptors  can  also  be  used   •  Pre^y  close  choice  between  MR,  Hive  and  Spark   •  Can  be  done  in  Spark  using  rdd.filter()   •  We  can  simply  embed  this  in  our  MR  sessioniza8on  job  
  • 130. 130  ©  Cloudera,  Inc.  All  rights  reserved.   Deduplica8on  –  remove  duplicate  records   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
  • 131. 131  ©  Cloudera,  Inc.  All  rights  reserved.   Deduplica8on  recommenda8on   •  Can  be  done  in  all  engines.   •  We  already  have  a  Hive  table  with  all  the  columns,  a  simple  DISTINCT  query   will  perform  deduplica8on   •  reduce()  in  spark   •  We  use  Pig  
  • 132. 132  ©  Cloudera,  Inc.  All  rights  reserved.   BI/Discovery  engine  recommenda8on   •  Main  requirements  for  this  are:   • Low  latency   • SQL  interface  (e.g.  JDBC/ODBC)   • Users  don’t  know  how  to  code   •  We  chose  Impala   • It’s  a  SQL  engine   • Much  faster  than  other  engines   • Provides  standard  JDBC/ODBC  interfaces  
  • 133. 133  ©  Cloudera,  Inc.  All  rights  reserved.   End-­‐to-­‐end  processing   Sessioniza8on  Filtering  Deduplica8on   BI  tools  
  • 134. 134  ©  Cloudera,  Inc.  All  rights  reserved.   Architectural  Considera8ons   Orchestra8on  
  • 135. 135  ©  Cloudera,  Inc.  All  rights  reserved.   •  Data  arrives  through  Flume   •  Triggers  a  processing  event:   • Sessionize   • Enrich  –  Loca8on,  marke8ng  channel…   • Store  as  Parquet   •  Each  day  we  process  events  from  the  previous  day   Orchestra8ng  Clickstream  
  • 136. 136  ©  Cloudera,  Inc.  All  rights  reserved.   •  Workflow  is  fairly  simple   •  Need  to  trigger  workflow  based  on  data   •  Be  able  to  recover  from  errors   •  Perhaps  no8fy  on  the  status   •  And  collect  metrics  for  repor8ng   Choosing  Right  
  • 137. 137  ©  Cloudera,  Inc.  All  rights  reserved.   Oozie  or  Azkaban?  
  • 138. 138  ©  Cloudera,  Inc.  All  rights  reserved.   Oozie  Architecture  
  • 139. 139  ©  Cloudera,  Inc.  All  rights  reserved.   •  Part  of  all  major  Hadoop  distribu8ons   •  Hue  integra8on   •  Built  -­‐in  ac8ons  –  Hive,  Sqoop,  MapReduce,  SSH   •  Complex  workflows  with  decisions   •  Event  and  8me  based  scheduling   •  No8fica8ons   •  SLA  Monitoring   •  REST  API   Oozie  features  
  • 140. 140  ©  Cloudera,  Inc.  All  rights  reserved.   •  Overhead  in  launching  jobs   •  Steep  learning  curve   •  XML  Workflows   Oozie  Drawbacks  
  • 141. 141  ©  Cloudera,  Inc.  All  rights  reserved.   Azkaban  Architecture       Azkaban  Executor  Server       Azkaban  Web  Server   HDFS  viewer   plugin   Job  types  plugin   MySQL   Client   Hadoop  
  • 142. 142  ©  Cloudera,  Inc.  All  rights  reserved.   •  Simplicity   •  Great  UI  –  including  pluggable  visualizers   •  Lots  of  plugins  –  Hive,  Pig…   •  Repor8ng  plugin   Azkaban  features  
  • 143. 143  ©  Cloudera,  Inc.  All  rights  reserved.   •  Doesn’t  support  workflow  decisions   •  Can’t  represent  data  dependency   Azkaban  Limita8ons  
  • 144. 144  ©  Cloudera,  Inc.  All  rights  reserved.   •  Workflow  is  fairly  simple   •  Need  to  trigger  workflow  based  on  data   •  Be  able  to  recover  from  errors   •  Perhaps  no8fy  on  the  status   •  And  collect  metrics  for  repor8ng   Choosing…   Easier  in  Oozie  
  • 145. 145  ©  Cloudera,  Inc.  All  rights  reserved.   •  Workflow  is  fairly  simple   •  Need  to  trigger  workflow  based  on  data   •  Be  able  to  recover  from  errors   •  Perhaps  no8fy  on  the  status   •  And  collect  metrics  for  repor8ng   Choosing  the  right  Orchestra8on  Tool   Be^er  in  Azkaban  
  • 146. 146  ©  Cloudera,  Inc.  All  rights  reserved.   The  best  orchestra8on  tool   is  the  one  you  are  an  expert  on   Important  Decision  Considera8on!  
  • 147. 147  ©  Cloudera,  Inc.  All  rights  reserved.   Orchestra8on  Pa^erns  –  Fan  Out  
  • 148. 148  ©  Cloudera,  Inc.  All  rights  reserved.   Capture  &  Decide  Pa^ern  
  • 149. 149  ©  Cloudera,  Inc.  All  rights  reserved.   Puwng  It  All  Together   Final  Architecture  
  • 150. 150  ©  Cloudera,  Inc.  All  rights  reserved.   Final  Architecture  –  High  Level  Overview   Data   Sources   Inges8on   Raw  Data   Storage   (Formats,   Schema)   Processed   Data  Storage   (Formats,   Schema)   Processing   Data   Consump8on   Orchestra8on   (Scheduling,   Managing,   Monitoring)  
  • 151. 151  ©  Cloudera,  Inc.  All  rights  reserved.   Final  Architecture  –  High  Level  Overview   Data   Sources   Inges8on   Raw  Data   Storage   (Formats,   Schema)   Processed   Data  Storage   (Formats,   Schema)   Processing   Data   Consump8on   Orchestra8on   (Scheduling,   Managing,   Monitoring)  
  • 152. 152  ©  Cloudera,  Inc.  All  rights  reserved.   Final  Architecture  –  Inges8on/Storage   Web  Server   Flume  Agent   Web  Server   Flume  Agent   Web  Server   Flume  Agent   Web  Server   Flume  Agent   Web  Server   Flume  Agent   Web  Server   Flume  Agent   Web  Server   Flume  Agent   Web  Server   Flume  Agent   Flume  Agent   Flume  Agent   Flume  Agent   Flume  Agent   Fan-­‐in     Pa^ern   Mul8  Agents  for     Failover  and  rolling  restarts   HDFS     /etl/BI/casualcyclist/ clicks/rawlogs/ year=2014/month=10/ day=10!
  • 153. 153  ©  Cloudera,  Inc.  All  rights  reserved.   Final  Architecture  –  High  Level  Overview   Data   Sources   Inges8on   Raw  Data   Storage   (Formats,   Schema)   Processed   Data  Storage   (Formats,   Schema)   Processing   Data   Consump8on   Orchestra8on   (Scheduling,   Managing,   Monitoring)  
  • 154. 154  ©  Cloudera,  Inc.  All  rights  reserved.   Final  Architecture  –  Processing  and  Storage   /etl/BI/ casualcyclist/clicks/ rawlogs/year=2014/ month=10/day=10   …   dedup-­‐>filtering-­‐ >sessioniza8on   /data/bikeshop/ clickstream/ year=2014/ month=10/day=10   …   parque8ze  
  • 155. 155  ©  Cloudera,  Inc.  All  rights  reserved.   Final  Architecture  –  High  Level  Overview   Data   Sources   Inges8on   Raw  Data   Storage   (Formats,   Schema)   Processed   Data  Storage   (Formats,   Schema)   Processing   Data   Consump8on   Orchestra8on   (Scheduling,   Managing,   Monitoring)  
  • 156. 156  ©  Cloudera,  Inc.  All  rights  reserved.   Final  Architecture  –  Data  Access   Hive/ Impala   BI/Analy8cs   Tools   DWH   Sqoop   Local  Disk   R,  etc.   DB  import  tool   JDBC/ODBC  
  • 157. 157  ©  Cloudera,  Inc.  All  rights  reserved.   Demo  
  • 158. 158  ©  Cloudera,  Inc.  All  rights  reserved.   Join  the  Discussion       Get  community   help  or  provide   feedback     cloudera.com/community  
  • 159. 159  ©  Cloudera,  Inc.  All  rights  reserved.  Try  Cloudera  Live  today—cloudera.com/live  
  • 160. 160  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   hadooparchitecturebook.com   8ny.cloudera.com/app-­‐arch-­‐slides