Application architectures with Hadoop and Sessionization in MR

  • 3,334 views
Uploaded on

A presentation of architecting applications with Hadoop and an example of sessionization code used for Clickstream analysis in MapReduce.

A presentation of architecting applications with Hadoop and an example of sessionization code used for Clickstream analysis in MapReduce.

More in: Engineering , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,334
On Slideshare
0
From Embeds
0
Number of Embeds
12

Actions

Shares
Downloads
88
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 Headline  Goes  Here   Speaker  Name  or  Subhead  Goes  Here   DO  NOT  USE  PUBLICLY   PRIOR  TO  10/23/12   ApplicaAon  Architectures  with   Apache  Hadoop   Mark  Grover  |  @mark_grover   East  Bay  JUG   hadooparchitecturebook.com   June  25th,  2014   ©2014 Cloudera, Inc. All Rights Reserved.
  • 2. About  Me   •  CommiTer  on  Apache  Bigtop,  commiTer  and  PPMC  member   on  Apache  Sentry  (incubaAng).   •  Contributor  to  Hadoop,  Hive,  Spark,  Sqoop,  Flume.   •  SoYware  developer  at  Cloudera   •  @mark_grover   2 ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. Co-­‐authoring  O’Reilly  book   •  @hadooparchbook   •  hadooparchitecturebook.com   ©2014 Cloudera, Inc. All Rights Reserved. 3
  • 4. What  is  Apache  Hadoop?   4 Has  the  Flexibility  to  Store  and   Mine  Any  Type  of  Data     §  Ask  quesAons  across  structured  and   unstructured  data  that  were  previously   impossible  to  ask  or  solve   §  Not  bound  by  a  single  schema   Excels  at   Processing  Complex  Data     §  Scale-­‐out  architecture  divides  workloads   across  mulAple  nodes   §  Flexible  file  system  eliminates  ETL   boTlenecks   Scales   Economically     §  Can  be  deployed  on  commodity   hardware   §  Open  source  plaaorm  guards  against   vendor  lock   Hadoop  Distributed   File  System  (HDFS)     Self-­‐Healing,  High   Bandwidth  Clustered   Storage       MapReduce     Distributed  CompuAng   Framework   Apache Hadoop  is  an  open  source   plaaorm  for  data  storage  and  processing   that  is…   ü  Scalable   ü  Fault  tolerant   ü  Distributed   CORE  HADOOP  SYSTEM  COMPONENTS   ©2013 Cloudera, Inc. All Rights Reserved.
  • 5. 5 Click  Stream  Analysis   Case  Study   ©2014 Cloudera, Inc. All Rights Reserved.
  • 6. AnalyAcs   ©2014 Cloudera, Inc. All Rights Reserved. 6  
  • 7. Web  Logs   ©2014 Cloudera, Inc. All Rights Reserved. 7   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 8. Clickstream  AnalyAcs   ©2014 Cloudera, Inc. All Rights Reserved. 8   244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ 36.0.1944.0 Safari/537.36”
  • 9. Click  Stream  Analysis  (Before  Hadoop)   9   ©2014 Cloudera, Inc. All Rights Reserved. Web  logs   ODS   Data   Warehouse   Query  Extract   Transform   Load   Business   Intelligence   Transform  
  • 10. The  Problems  (Before  Hadoop)   10   ©2014 Cloudera, Inc. All Rights Reserved. Web  logs   ODS   Data   Warehouse   Query  Extract   Transform   Load   Business   Intelligence   Transform   1   1   1   Slow  Data  TransformaAons  =  Missed  ETL  SLAs.   2   2   Slow  Queries  =  Frustrated  Business  Users.  
  • 11. Click  Stream  Analysis  (AYer  Hadoop)   11   ©2014 Cloudera, Inc. All Rights Reserved. Web  logs   ODS   Data   Warehouse   Query  Extract   Transform   Load   Business   Intelligence   Transform  X  X  
  • 12. Web  logs   ODS   Business   Intelligence   Query   Click  Stream  Analysis  (AYer  Hadoop)   12   ©2014 Cloudera, Inc. All Rights Reserved. Transform  
  • 13. Web  logs   ODS   Business   Intelligence   Query   Click  Stream  Analysis  (AYer  Hadoop)   13   ©2014 Cloudera, Inc. All Rights Reserved. Transform   Flume  or  Sqoop?   Flume  or  Sqoop?   Hive/Impala/MR?   Hive/Impala/Spark?  
  • 14. Challenges  of  Hadoop  ImplementaAon   14   ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. Challenges  of  Hadoop  ImplementaAon   15   ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. Other  challenges  -­‐  Architectural  ConsideraAons     •  Storage  managers?   •  HDFS?  HBase?   •  Data  storage  and  modeling:   •  File  formats?  Compression?  Schema  design?   •  Data  movement   •  How  do  we  actually  get  the  data  into  Hadoop?  How  do  we  get  it  out?   •  Metadata   •  How  do  we  manage  data  about  the  data?   •  Data  access  and  processing   •  How  will  the  data  be  accessed  once  in  Hadoop?  How  can  we  transform  it?  How  do   we  query  it?   •  OrchestraAon   •  How  do  we  manage  the  workflow  for  all  of  this?   16 ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. 17 High  level  Design   ©2014 Cloudera, Inc. All Rights Reserved.
  • 18. Clickstream  AnalyAcs   ©2014 Cloudera, Inc. All Rights Reserved. 18   244.157.45.12 - - [17/Oct/ 2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http:// bestcyclingreviews.com/ top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ 36.0.1944.0 Safari/537.36”
  • 19. ©2014 Cloudera, Inc. All Rights Reserved. 19   Hadoop   Cluster   BI/VisualizaAon   tool  (e.g.   microstrategy)   BI   Analysts   Spark   For  machine  learning   and  graph  processing   R/Python   StaAsAcal  Analysis   Custom   Apps   3.  Accessing   2.  Processing   4.  OrchestraAon   via  Oozie  1.  IngesAon   OperaAonal   Data  Store   CRM  System   Via  Sqoop   Web  servers   Website  users  
  • 20. 20 Since  that’s  of  most  interest  to  this  audience   2.  Processing   ©2014 Cloudera, Inc. All Rights Reserved.
  • 21. Processing   •  De-­‐duplicaAon   •  Filtering   •  SessionizaAon   21 ©2014 Cloudera, Inc. All Rights Reserved.
  • 22. DeduplicaAon  –  remove  duplicate  records   ©2014 Cloudera, Inc. All Rights Reserved. 22   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
  • 23. Filtering  –  filter  out  invalid  records   ©2014 Cloudera, Inc. All Rights Reserved. 23   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
  • 24. SessionizaAon   ©2014 Cloudera, Inc. All Rights Reserved. 24   Website  visit   Visitor  1   Session  1   Visitor  1   Session  2   Visitor  2   Session  1   >  30  minutes  
  • 25. Why  sessionize?   Helps  answers  quesAons  like:   •  What  is  my  website’s  bounce  rate?   •  i.e.  how  many  %  of  visitors  don’t  go  past  the  landing  page?   •  Which  markeAng  channels  (e.g.  organic  search,  display  ad,  etc.)   are  leading  to  most  sessions?   •  Which  ones  of  those  lead  to  most  conversions  (e.g.  people   buying  things,  signing  up,  etc.)   •  Do  aTribuAon  analysis  –  which  channels  are  responsible  for   most  conversions?   25 ©2014 Cloudera, Inc. All Rights Reserved.
  • 26. SessionizaAon   ©2014 Cloudera, Inc. All Rights Reserved. 26   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166
  • 27. How  to  Sessionize?   1.  Given  a  list  of  clicks,  determine  which  clicks  came  from  the   same  user   2.  Given  a  parAcular  user's  clicks,  determine  if  a  given  click  is  a   part  of  a  new  session  or  a  conAnuaAon  of  the  previous   session   27 ©2014 Cloudera, Inc. All Rights Reserved.
  • 28. #1  –  Which  clicks  are  from  same  user?   •  We  can  use:   •  IP  address  (244.157.45.12)   •  Cookies  (A9A3BECE0563982D)   •  IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36")   28 ©2014 Cloudera, Inc. All Rights Reserved.
  • 29. #1  –  Which  clicks  are  from  same  user?   •  We  can  use:   •  IP  address  (244.157.45.12)   •  Cookies  (A9A3BECE0563982D)   •  IP  address  (244.157.45.12)and  user  agent  string  ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/ 537.36")   29 ©2014 Cloudera, Inc. All Rights Reserved.
  • 30. #1  –  Which  clicks  are  from  same  user?   ©2014 Cloudera, Inc. All Rights Reserved. 30   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
  • 31. #2  –  Which  clicks    part  of  the  same  session?   ©2014 Cloudera, Inc. All Rights Reserved. 31   244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp? productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/ GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” >  30  mins  apart  =  different   sessions  
  • 32. 32 Intro  to  MapReduce   ©2014 Cloudera, Inc. All Rights Reserved.
  • 33. MapReduce   •  Map  –  Apply  a  funcAon  to  each  input  record   •  Shuffle  &  Sort  –  ParAAon  the  map  output  and  sort  each   parAAon   •  Reduce  –  Apply  aggregaAon  funcAon  to  all  values  in  each   parAAon   33 ©2014 Cloudera, Inc. All Rights Reserved.
  • 34. 34 SessionizaAon  in  MapReduce   ©2014 Cloudera, Inc. All Rights Reserved.
  • 35. 35 github.com/hadooparchitecturebook/hadoop-­‐arch-­‐book/tree/master/ch06/MRSessionize   SessionizaAon  in  MapReduce   ©2014 Cloudera, Inc. All Rights Reserved.
  • 36. SessionizaAon  in  MapReduce   36 ©2014 Cloudera, Inc. All Rights Reserved. Map   Reduce   Reduce   Log  line   IP1,  log  lines   IP1,  log  lines   Log  line,  session  ID   Map   Map   Log  line   Log  line   IP2,  log  lines   IP2,  log  lines   Log  line,  session  ID  
  • 37. Mapper  for  SessionizaAon   37 ©2014 Cloudera, Inc. All Rights Reserved.  public  static  class  SessionizeMapper                          extends  Mapper<Object,  Text,  IpTimestampKey,  Text>  {                  private  Matcher  logRecordMatcher;                    public  void  map(Object  key,  Text  value,  Context  context                  )  throws  IOException,  InterruptedException  {                          logRecordMatcher  =  logRecordPattern.matcher(value.toString());                            //  We  only  emit  something  out  if  the  record  matches  with  our  regex.  Otherwise,  we   assume  the  record  is  busted  and  simply  ignore  it                          if  (logRecordMatcher.matches())  {                                  String  ip  =  logRecordMatcher.group(1);                                  DateTime  timestamp  =  DateTime.parse(logRecordMatcher.group(2),   TIMESTAMP_FORMATTER);                                  Long  unixTimestamp  =  timestamp.getMillis();                                  IpTimestampKey  outputKey  =  new  IpTimestampKey(ip,  unixTimestamp);                                  context.write(outputKey,  value);                          }                  }          }  
  • 38. Reducer  for  SessionizaAon   38 ©2014 Cloudera, Inc. All Rights Reserved.  public  static  class  SessionizeReducer                          extends  Reducer<IpTimestampKey,  Text,  IpTimestampKey,  Text>  {                  private  Text  result  =  new  Text();                  private  static  int  sessionId  =  0;                  private  Long  lastTimeStamp  =  null;                  public  void  reduce(IpTimestampKey  key,  Iterable<Text>  values,  Context  context                  )  throws  IOException,  InterruptedException  {                          for  (Text  value  :  values)  {                                  String  logRecord  =  value.toString();                                  //  If  this  is  the  first  record  for  this  user  or  it's  been  more  than  the  timeout                                    //  since  the  last  click  from  this  user,  let's  increment  the  session  ID.                                  if  (lastTimeStamp  ==  null  ||  (key.getUnixTimestamp()  -­‐  lastTimeStamp  >   SESSION_TIMEOUT_IN_MS))  {                                          sessionId++;                                  }                                  lastTimeStamp  =  key.getUnixTimestamp();                                  result.set(logRecord  +  "  "  +  sessionId);                                  //  Since  we  only  care  about  printing  out  the  entire  record  in  the  result,  with                                  //  session  ID  appended  at  the  end,  we  just  emit  out  "null"  for  the  key                                  context.write(null,  result);                          }                  }          }  
  • 39. Secondary  sorAng  –  by  Amestamp   •  Need  records  to  reducer  to  be  grouped  by  IP  address  and   sorted  by  Amestamp  –  a  concept  called  secondary  sor/ng   •  Instead  of  using  just  IP  address  as  map  output  key  and  reduce   input  key   •  We  use  a  composite  key  (IP,  Amestamp)  as  map  output  key  and   reduce  input  key   39 ©2014 Cloudera, Inc. All Rights Reserved.
  • 40. Secondary  sorAng  –  vocabulary   •  Composite  key  –  IP  address,  Amestamp   •  Natural  key  –  IP  address   •  Secondary  sort  key  -­‐  Amestamp   40 ©2014 Cloudera, Inc. All Rights Reserved.
  • 41. Secondary  sorAng   •  Custom  Grouping  Comparator  –  on  Natural  Key  (IP)   •  Custom  Sort  Comparator  –  on  Composite  Key  (IP,  address)   •  Custom  ParAAoner  –  on  Natural  Key  (IP)     job.setGroupingComparatorClass(NaturalKeyComparator.class );         job.setSortComparatorClass(CompositeKeyComparator.class);   job.setPartitionerClass(NaturalKeyPartitioner.class);     41 ©2014 Cloudera, Inc. All Rights Reserved.
  • 42. 42 Data  Storage  and  Modeling   ©2014 Cloudera, Inc. All Rights Reserved.
  • 43. Data  Storage  –  Storage  Manager  consideraAons   •  Popular  storage  managers  for  Hadoop   •  Hadoop  Distributed  File  System  (HDFS)   •  HBase   43 ©2014 Cloudera, Inc. All Rights Reserved.
  • 44. Data  Storage  –  HDFS  vs  HBase   HDFS   •  Stores  data  directly  as  files   •  Fast  scans   •  Poor  random  reads/writes   HBase   •  Stores  data  as  Hfiles  on  HDFS   •  Slow  scans   •  Fast  random  reads/writes   44   ©2014 Cloudera, Inc. All Rights Reserved.
  • 45. Data  Storage  –  Storage  Manager  consideraAons   •  We  choose  HDFS   •  AnalyAcal  needs  in  this  case  served  beTer  by  fast  scans.   45 ©2014 Cloudera, Inc. All Rights Reserved.
  • 46. 46 Data  Storage  Format   ©2014 Cloudera, Inc. All Rights Reserved.
  • 47. Data  Storage  –  Format  ConsideraAons     •  Store  as  plain  text?   •  Sure,  well  supported  by  Hadoop.   •  Text  can  easily  be  processed  by  MapReduce,  loaded  into  Hive  for   analysis,  and  so  on.   •  But…   •  Will  begin  to  consume  lots  of  space  in  HDFS.   •  May  not  be  opAmal  for  processing  by  tools  in  the  Hadoop   ecosystem.   47 ©2014 Cloudera, Inc. All Rights Reserved.
  • 48. Data  Storage  –  Format  ConsideraAons     •  But,  we  can  compress  the  text  files…   •  Gzip  –  supported  by  Hadoop,  but  not  spliTable.   •  Bzip2  –  hey,  spliTable!  Great  compression!  But  decompression  is   slooowww.   •  LZO  –  spliTable  (with  some  work),  good  compress/de-­‐compress   performance.  Good  choice  for  storing  text  files  on  Hadoop.     •  Snappy  –  provides  a  good  tradeoff  between  size  and  speed.     48 ©2014 Cloudera, Inc. All Rights Reserved.
  • 49. Data  Storage  –  More  About  Snappy   •  Designed  at  Google  to  provide  high  compression  speeds  with   reasonable  compression.   •  Not  the  highest  compression,  but  provides  very  good  performance   for  processing  on  Hadoop.   •  Snappy  is  not  spliTable  though,  which  brings  us  to…     49 ©2014 Cloudera, Inc. All Rights Reserved.
  • 50. SequenceFile   • Stores  records  as  binary   key/value  pairs.   • SequenceFile  “blocks”   can  be  compressed.   • This  enables  spliTability   with  non-­‐spliTable   compression.       50   ©2014 Cloudera, Inc. All Rights Reserved.
  • 51. Avro   •  Kinda  SequenceFile  on   Steroids.   •  Self-­‐documenAng  –  stores   schema  in  header.   •  Provides  very  efficient   storage.   •  Supports  spliTable   compression.   51   ©2014 Cloudera, Inc. All Rights Reserved.
  • 52. Our  Format  Choices…   •  Avro  with  Snappy   •  Snappy  provides  opAmized  compression.   •  Avro  provides  compact  storage,  self-­‐documenAng  files,  and   supports  schema  evoluAon.   •  Avro  also  provides  beTer  failure  handling  than  other  choices.   •  SequenceFiles  would  also  be  a  good  choice,  and  are  directly   supported  by  ingesAon  tools  in  the  ecosystem.   •  But  only  supports  Java.   52 ©2014 Cloudera, Inc. All Rights Reserved.
  • 53. 53 HDFS  Schema  Design   ©2014 Cloudera, Inc. All Rights Reserved.
  • 54. Recommended  HDFS  Schema  Design   •  How  to  lay  out  data  on  HDFS?   54 ©2014 Cloudera, Inc. All Rights Reserved.
  • 55. Recommended  HDFS  Schema  Design   /user/<username>  -­‐  User  specific  data,  jars,  conf  files   /etl  –  Data  in  various  stages  of  ETL  workflow   /tmp  –  temp  data  from  tools  or  shared  between  users   /data  –  shared  data  for  the  enAre  organizaAon   /app  –  Everything  but  data:  UDF  jars,  HQL  files,  Oozie  workflows   55 ©2014 Cloudera, Inc. All Rights Reserved.
  • 56. 56 Advanced  HDFS  Schema  Design   ©2014 Cloudera, Inc. All Rights Reserved.
  • 57. What  is  ParAAoning?   57 dataset        col=val1/file.txt        col=val2/file.txt          .          .          .        col=valn/file.txt   dataset      file1.txt      file2.txt          .          .          .        filen.txt   Un-­‐parAAoned  HDFS   directory  structure   ParAAoned  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 58. What  is  ParAAoning?   58 clicks        dt=2014-­‐01-­‐01/clicks.txt        dt=2014-­‐01-­‐02/clicks.txt          .          .          .        dt=2014-­‐03-­‐31/clicks.txt   clicks      clicks-­‐2014-­‐01-­‐01.txt      clicks-­‐2014-­‐01-­‐02.txt          .          .          .        clicks-­‐2014-­‐03-­‐31.txt   Un-­‐parAAoned  HDFS   directory  structure   ParAAoned  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 59. ParAAoning   •  Split  the  dataset  into  smaller  consumable  chunks   •  Rudimentary  form  of  “indexing”   •  <data  set  name>/ <parAAon_column_name=parAAon_column_value>/{files}   59 ©2014 Cloudera, Inc. All Rights Reserved.
  • 60. ParAAoning  consideraAons   •  What  column  to  bucket  by?   •  HDFS  is  append  only.   •  Don’t  have  too  many  parAAons  (<10,000)   •  Don’t  have  too  many  small  files  in  the  parAAons  (more  than   block  size  generally)   •  We  decided  to  parAAon  by  /mestamp   60 ©2014 Cloudera, Inc. All Rights Reserved.
  • 61. What  is  buckeAng?   61 clicks        dt=2014-­‐01-­‐01/clicks.txt          dt=2014-­‐01-­‐02/clicks.txt   Un-­‐bucketed  HDFS   directory  structure   clicks        dt=2014-­‐01-­‐01/file0.txt        dt=2014-­‐01-­‐01/file1.txt        dt=2014-­‐01-­‐01/file2.txt        dt=2014-­‐01-­‐01/file3.txt          dt=2014-­‐01-­‐02/file0.txt        dt=2014-­‐01-­‐02/file1.txt        dt=2014-­‐01-­‐02/file2.txt        dt=2014-­‐01-­‐02/file3.txt   Bucketed  HDFS  directory   structure   ©2014 Cloudera, Inc. All Rights Reserved.
  • 62. BuckeAng   •  Hash-­‐bucketed  files  within  each  parAAon  based  on  a  parAcular   column   •  Useful  when  sampling   •  In  some  joins,  pre-­‐reqs:   •  Datasets  bucketed  on  the  same  key  as  the  join  key   •  Number  of  buckets  are  the  same  or  one  is  a  mulAple  of  the  other   62 ©2014 Cloudera, Inc. All Rights Reserved.
  • 63. BuckeAng  consideraAons?   •  Which  column  to  bucket  on?   •  How  many  buckets?   •  We  decided  to  bucket  based  on  cookie   63 ©2014 Cloudera, Inc. All Rights Reserved.
  • 64. De-­‐normalizing  consideraAons   •  In  general,  big  data  joins  are  expensive   •  When  to  de-­‐normalize?   •  Decided  to  join  the  smaller  dimension  tables   •  Big  fact  tables  are  sAll  joined   64 ©2014 Cloudera, Inc. All Rights Reserved.
  • 65. 65 Data  IngesAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 66. File  Transfers     • “hadoop  fs  –put  <file>”   • Reliable,  but  not  resilient   to  failure.   • Other  opAons  are   mountable  HDFS,  for   example  NFSv3.   66   ©2014 Cloudera, Inc. All Rights Reserved.
  • 67. Streaming  IngesAon   •  Flume   •  Reliable,  distributed,  and  available  system  for  efficient  collecAon,   aggregaAon  and  movement  of  streaming  data,  e.g.  logs.   •  Ka•a   •  Reliable  and  distributed  publish-­‐subscribe  messaging  system.   67 ©2014 Cloudera, Inc. All Rights Reserved.
  • 68. Flume  vs.  Ka•a   • Purpose  built  for  Hadoop   data  ingest.   • Pre-­‐built  sinks  for  HDFS,   HBase,  etc.   • Supports  transformaAon   of  data  in-­‐flight.   • General  pub-­‐sub   messaging  framework.   • Hadoop  not  supported,   requires  3rd-­‐party   component  (Camus).   • Just  a  message  transport   (a  very  fast  one).   68   ©2014 Cloudera, Inc. All Rights Reserved.
  • 69. Flume  vs.  Ka•a   •  BoTom  line:   •  Flume  very  well  integrated  with  Hadoop  ecosystem,  well  suited   to  ingesAon  of  sources  such  as  log  files.   •  Ka•a  is  a  highly  reliable  and  scalable  enterprise  messaging   system,  and  great  for  scaling  out  to  mulAple  consumers.   69 ©2014 Cloudera, Inc. All Rights Reserved.
  • 70. A  Quick  IntroducAon  to  Flume   70   Flume  Agent   Source   Channel   Sink   DesAnaAon  External   Source   Web  Server   TwiTer   JMS   System  logs   …   Consumes  events   and  forwards  to   channels   Stores  events   unAl  consumed   by  sinks  –  file,   memory,  JDBC   Removes  event  from   channel  and  puts   into  external   desAnaAon   JVM    process  hosAng  components   ©2014 Cloudera, Inc. All Rights Reserved.
  • 71. A  Quick  IntroducAon  to  Flume   •  Reliable  –  events  are  stored  in  channel  unAl  delivered  to  next  stage.   •  Recoverable  –  events  can  be  persisted  to  disk  and  recovered  in  the   event  of  failure.   71 Flume  Agent   Source   Channel   Sink   DesAnaAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 72. A  Quick  IntroducAon  to  Flume   • DeclaraAve     •  No  coding  required.   •  ConfiguraAon  specifies   how  components  are   wired  together.   72   ©2014 Cloudera, Inc. All Rights Reserved.
  • 73. A  Brief  Discussion  of  Flume  PaTerns  –  Fan-­‐in   • Flume  agent  runs  on   each  of  our  servers.   • These  agents  send  data   to  mulAple  agents  to   provide  reliability.   • Flume  provides  support   for  load  balancing.   73   ©2014 Cloudera, Inc. All Rights Reserved.
  • 74. A  Brief  Discussion  of  Flume  PaTerns  –  Spli‚ng   •  Common  need  is  to  split   data  on  ingest.   •  For  example:   •  Sending  data  to  mulAple   clusters  for  DR.   •  To  mulAple  desAnaAons.   •  Flume  also  supports   parAAoning,  which  is  key   to  our  implementaAon.   74   ©2014 Cloudera, Inc. All Rights Reserved.
  • 75. Sqoop  Overview   •  Apache  project  designed  to  ease  import  and  export  of  data   between  Hadoop  and  external  data  stores  such  as  relaAonal   databases.   •  Great  for  doing  bulk  imports  and  exports  of  data  between   HDFS,  Hive  and  HBase  and  an  external  data  store.  Not  suited   for  ingesAng  event  based  data.   ©2014 Cloudera, Inc. All Rights Reserved. 75
  • 76. IngesAon  Decisions   •  Historical  Data   •  Smaller  files:  file  transfer   •  Larger  files:  Flume  with  spooling  directory  source.   •  Incoming  Data   •  Flume  with  the  spooling  directory  source.   76 ©2014 Cloudera, Inc. All Rights Reserved.
  • 77. 77 Data  Processing  and  Access   ©2014 Cloudera, Inc. All Rights Reserved.
  • 78. Data  flow   78   Raw  data   ParAAoned   clickstream   data   Other  data   (Financial,   CRM,  etc.)   Aggregated   dataset  #2   Aggregated   dataset  #1   ©2014 Cloudera, Inc. All Rights Reserved.
  • 79. Data  processing  tools   79   •  Hive   •  Impala   •  Pig,  etc.   ©2014 Cloudera, Inc. All Rights Reserved.
  • 80. Hive   80   •  Open  source  data  warehouse  system  for  Hadoop   •  Converts  SQL-­‐like  queries  to  MapReduce  jobs   •  Work  is  being  done  to  move  this  away  from  MR   •  Stores  metadata  in  Hive  metastore   •  Can  create  tables  over  HDFS  or  HBase  data   •  Access  available  via  JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 81. Impala   81   •  Real-­‐Ame  open  source  SQL  query  engine  for  Hadoop   •  Doesn’t  build  on  MapReduce   •  WriTen  in  C++,  uses  LLVM  for  run-­‐Ame  code  generaAon   •  Can  create  tables  over  HDFS  or  HBase  data   •  Accesses  Hive  metastore  for  metadata   •  Access  available  via  JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 82. Pig   82   •  Higher  level  abstracAon  over  MapReduce  (like  Hive)   •  Write  transformaAons  in  scripAng  language  –  Pig  LaAn   •  Can  access  Hive  metastore  via  HCatalog  for  metadata   ©2014 Cloudera, Inc. All Rights Reserved.
  • 83. Data  Processing  consideraAons   83   •  We  chose  Hive  for  ETL    and  Impala  for  interac/ve  BI.   ©2014 Cloudera, Inc. All Rights Reserved.
  • 84. 84 Metadata  Management   ©2014 Cloudera, Inc. All Rights Reserved.
  • 85. What  is  Metadata?   85   •  Metadata  is  data  about  the  data   •  Format  in  which  data  is  stored   •  Compression  codec   •  LocaAon  of  the  data   •  Is  the  data  parAAoned/bucketed/sorted?   ©2014 Cloudera, Inc. All Rights Reserved.
  • 86. Metadata  in  Hive   86 Hive   Metastore   ©2014 Cloudera, Inc. All Rights Reserved.
  • 87. Metadata   87   •  Hive  metastore  has  become  the  de-­‐facto  metadata  repository   •  HCatalog  makes  Hive  metastore  accessible  to  other   applicaAons  (Pig,  MapReduce,  custom  apps,  etc.)   ©2014 Cloudera, Inc. All Rights Reserved.
  • 88. Hive  +  HCatalog   88   ©2014 Cloudera, Inc. All Rights Reserved.
  • 89. 89 OrchestraAon   ©2014 Cloudera, Inc. All Rights Reserved.
  • 90. OrchestraAon   •  Once  the  data  is  in  Hadoop,  we  need  a  way  to  manage   workflows  in  our  architecture.   •  Scheduling  and  tracking  MapReduce  jobs,  Hive  jobs,  etc.   •  Several  opAons  here:   •  Cron   •  Oozie,  Azkaban   •  3rd-­‐party  tools,  Talend,  Pentaho,  InformaAca,  enterprise   schedulers.   90 ©2014 Cloudera, Inc. All Rights Reserved.
  • 91. Oozie   • Supports  defining  and   execuAng  a  sequence  of   jobs.   • Can  trigger  jobs  based  on   external  dependencies  or   schedules.   91   ©2014 Cloudera, Inc. All Rights Reserved.
  • 92. 92 Final  Architecture   ©2014 Cloudera, Inc. All Rights Reserved.
  • 93. Final  Architecture  –  High  Level  Overview   93   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 94. Final  Architecture  –  High  Level  Overview   94   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 95. Final  Architecture  –  IngesAon   95   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Web  App   Avro  Agent   Flume  Agent   Flume  Agent   Flume  Agent   Flume  Agent   Fan-­‐in     PaTern   MulA  Agents  for     Failover  and  rolling  restarts   HDFS     ©2014 Cloudera, Inc. All Rights Reserved.
  • 96. Final  Architecture  –  High  Level  Overview   96   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 97. Final  Architecture  –  Storage  and  Processing   97   /etl/weblogs/20140331/   /etl/weblogs/20140401/   …   Data  Processing   /data/markeAng/clickstream/bouncerate/   /data/markeAng/clickstream/aTribuAon/   …   ©2014 Cloudera, Inc. All Rights Reserved.
  • 98. Final  Architecture  –  High  Level  Overview   98   Data   Sources   IngesAon   Data   Storage/ Processing   Data   ReporAng/ Analysis   ©2014 Cloudera, Inc. All Rights Reserved.
  • 99. Final  Architecture  –  Data  Access   99   Hive/ Impala   BI/ AnalyAcs   Tools   DWH   Sqoop   Local   Disk   R,  etc.   DB  import  tool   JDBC/ODBC   ©2014 Cloudera, Inc. All Rights Reserved.
  • 100. Contact  info   •  Mark  Grover   •  @mark_grover   •  www.linkedin.com/in/grovermark   •  Slides  at  slideshare.net/markgrover   100 ©2014 Cloudera, Inc. All Rights Reserved.
  • 101. 101 ©2014 Cloudera, Inc. All Rights Reserved.