• Like
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Search in the Apache Hadoop Ecosystem: Thoughts from the Field

  • 825 views
Published

This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case …

This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
825
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
44
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Search  in  the  Apache  Hadoop   Ecosystem:  Thoughts  from  the  Field   Open  Source  Search  Conference,  November  2013   Alex  Moundalexis   alexm@clouderagovt.com     @technmsg   1
  • 2. Thoughts  of  a  Former  SA   2
  • 3. Thoughts  of  a  Former  SA  Field  Guy   3
  • 4. Disclaimer   •  •  Technologies,  not  products   Cloudera  builds  things  soKware   •  •  •  •  most  donated  to  Apache   some  closed-­‐source   I  will  likely  menPon  “Cloudera  Something”   Cloudera  “products”  I  reference  are  open  source   •  •  Apache  Licensed   Source  code  is  on  GitHub   •  4 hTps://github.com/cloudera  
  • 5. What  This  Talk  Isn’t  About   •  Deploying   •  •  Sizing  &  Tuning   •  •  •  5 Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   Depends  heavily  on  data  and  workload   Coding   Algorithms  
  • 6. “  The  answer  to  most   Hadoop  quesPons  is  it   depends.”   6  
  • 7. The  Apache  Hadoop  Ecosystem   Quick  and  dirty,  more  Pme  for  use  cases.   7
  • 8. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  •  •  Today,  dozens  of  interrelated  components   •  •  •  •  •  8 HDFS   MapReduce   I/O   Processing   Specialty  ApplicaPons   ConfiguraPon   Workflow  
  • 9. ParPal  Ecosystem   API  access   external  system   external  system   API  access   BI  tool   +  JDBC/ODBC   web  server   SQL   Hadoop   Search   log  collecPon   device  logs   user   batch  processing   DB  table  import   RDBMS  /  DWH   9 machine  learning   DB  table    export   RDBMS  /  DWH  
  • 10. HDFS   •  •  •  Distributed,  highly  fault-­‐tolerant  filesystem   OpPmized  for  large  streaming  access  to  data   Based  on  Google  File  System   •  10 hTp://research.google.com/archive/gfs.html  
  • 11. Lots  of  Commodity  Machines   Image:Yahoo! Hadoop cluster [ OSCON ’07 ] 11
  • 12. MapReduce  (MR)   •  •  •  •  •  Programming  paradigm   Batch  oriented,  not  realPme   Works  well  with  distributed  compuPng   Lots  of  Java,  but  other  languages  supported   Based  on  Google’s  paper   •  12 hTp://research.google.com/archive/mapreduce.html  
  • 13. Under  the  Covers  
  • 14. You specify map() and reduce() functions.  The framework does the rest. 60
  • 15. Apache  HBase   •  •  •  •  Random,  realPme  read/write  access   Key/value  columnar  store   (b|tr)illions  of  rows/columns   Based  on  Google  BigTable   •  15 hTp://research.google.com/archive/bigtable.html  
  • 16. Apache  Accumulo   •  •  •  •  Random,  realPme  read/write  access   Key/value  columnar  store   (b|tr)illions  of  rows/columns   Based  on  Google  BigTable   •  •  •  Adds  cell-­‐level  security   Implemented  by  NaPonal  Security  Agency   •  16 hTp://research.google.com/archive/bigtable.html   Donated  to  ASF  
  • 17. Apache  Hive  &  Pig   •  AbstracPon  of  Hadoop’s  Java  API   •  •  •  17 Hive  is  SQL-­‐based   Pig  is  more  data-­‐flow  oriented   Eases  analysis  using  MapReduce  
  • 18. Cloudera  Impala   •  •  •  •  18 SQL-­‐based,  but  interacPve  response   Backed  by  HDFS  or  HBase   Allows  for  fast  iteraPon/discovery   Not  as  fault-­‐tolerant  as  MapReduce  
  • 19. Apache  Sqoop  &  Flume   •  •  •  19 Get  your  data  in  and  out  of  HDFS   Sqoop  focuses  on  relaPonal  databases   Flume  focuses  on  log  files  
  • 20. Cloudera  Hue   •  •  •  •  20 Hadoop  User  Experience   Hadoop  is  largely  command  line   Hue  provides  a  UI  for  end-­‐users   SDK  to  build  your  own  apps  on  top  
  • 21. Apache  Mahout   •  Machine  learning  algorithms  that  run  on  MapReduce   •  •  •  •  I  didn’t  study  these  algorithms  in  school   •  •  •  21 Clustering   ClassificaPon   Filtering   Data  science  people  are  excited   Math  people  are  excited   I’m  excited  for  them  
  • 22. Apache  Tika   •  •  •  Content  analysis  toolkit   Simply  put,  a  lot  of  parsers   Detect/extract  metadata/text  from  documents   •  •  •  •  •  •  22 HTML   XML   Office   PDF   mbox   More…  
  • 23. Apache  ZooKeeper   •  Distributed  systems  are  HARD   •  •  •  ZK:  Highly  reliable  distributed  coordinaPon  services   •  •  •  •  23 Everyone  was  trying  to  implement  the  same  subsystems   Bugs  leads  to  race  condiPons,  other  bad  things   ConfiguraPon   Naming   SynchronizaPon   Group  Services  
  • 24. Apache  Oozie   •  •  •  Workflow  scheduling  for  Hadoop   Like  cron,  but  in  directed  graph  fashion   Out  of  box  hooks:   •  •  •  •  •  24 MR   Pig   Hive   Sqoop   Impala  
  • 25. Sentry  (incubaPng)   •  •  25 Role-­‐based  access  control  for  Hive/Impala/Solr   Regulatory/compliance  assurance  
  • 26. Cloudera  Morphlines   •  In-­‐memory  transformaPons   •  •  •  •  26 Load,  parse,  transform,  process   Records  as  name-­‐value  pairs  w/  opPonal  blob/pojo  objects   Java  library,  embedded  in  your  codebase   Used  to  ETL  data  from  Flume  and  MR  into  Solr  
  • 27. Apache  Lucene   •  Java-­‐based  index  and  search   •  •  •  27 Spellchecking   Hit  highlighPng   TokenizaPon  
  • 28. Apache  Solr   •  Enterprise  search  plaoorm   •  •  •  •  28 Based  on  Apache  Lucene   Full-­‐text  search   FacePng   NRT  indexing  
  • 29. Apache  SolrCloud   •  •  29 IntegraPon  of  Solr  +  ZooKeeper   Provides  for  shard  failover  
  • 30. Cloudera  Search   •  •  •  Based  on  Apache  Solr  (incl  Lucene  and  SolrCloud)   Fault-­‐tolerance:  collecPons  backed  by  HDFS  or  Hbase   IntegraPon  galore:   •  •  •  •  •    30 HBase/Flume/MapReduce  w/  Lucene   Hue  w/  Solr   Avro  w/  Tika   HDFS  w/  Solr/Lucene   Sentry  w/  Solr  
  • 31. Cloudera  Search  +  Hue   31  
  • 32. Cloudera  Search  +  Hue   32  
  • 33. Why  Search?   Apologies,  I  swiped  some  preTy  slides  from  markePng…   33
  • 34. Search  Design  Strategy   An  Integrated  Part  of   the  Hadoop  System   Engines   Batch   InteracPve   InteracPve   CLOUDERA   IMPALA   Machine   Math  &   Search   Learning   Sta5s5cs   … CLOUDERA   SEARCH   MAHOUT   SAS,  R     Resource  Management   Metadata   One  security  framework   SQL   MAPREDUCE,   HIVE  &  PIG   One  pool  of  data   Processing   Storage   HDFS   HBase   TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS   One  set  of  system  resources   One  management  interface   34 Integra5on  
  • 35. Benefits  of  Search  IntegraPon   Improved  Big  Data  ROI   §  An  interacPve  experience  without  technical  knowledge   §  Single  data  set  for  mulPple  compuPng  frameworks   Faster  Time  to  Insight   §  Exploratory  analysis,  esp.  unstructured  data   §  Broad  range  of  indexing  opPons  to  accommodate  needs   Cost  Efficiency   §  Single  scalable  plaoorm;  no  incremental  investment   §  No  need  for  separate  systems,  storage   Solid  Founda5ons  &  Reliability   §  Solr  in  producPon  environments  for  years   §  Hadoop-­‐powered  reliability  and  scalability   35
  • 36. Making  Decisions   So  much  soKware…   36
  • 37. That’s  a  Lot  of  SoKware   •  21  packages,  depending  on  how  you  count   •  •  37 And  there’s  plenty  more…   How  to  decide  what  to  use?  
  • 38. “  The  answer  to  most   Hadoop  quesPons  is  it   depends.”   38  
  • 39. Some  of  the  Big  Issues   •  •  •  •  •    39 Response  Pme   User  interfaces   Programming  paradigm   Input/output  formats   Use  cases  
  • 40. Response  Time   •  MapReduce  is  batch  oriented   •  •  •  •  Resilient  to  hardware  failures   Robust  scheduling  opPons   Impala  is  near-­‐realPme   HBase  is  realPme   •  Key/values  are  cached  in  memory   •  •  40 Search  can  be  (near-­‐)realPme.   Hybrid  systems  are  common!  
  • 41. User  Interfaces   •  Java   •  •  SQL   •  •  Pig   Natural  Language  /  Free  Text   •  41 Hive,  Impala   Shell   •  •  MapReduce,  HBase   Search  
  • 42. Data  Constraints   •  MapReduce   •  •  •  HBase   •  •  •  Columnar  key/value  store   Hue  makes  this  easier   Search   •  •  42 Paradigm  takes  some  getng  used  to   Processing  must  accommodate  format   Indexing  and  display   Hue  makes  this  easier  
  • 43. Input/Output  Formats   •  Know  what  they  are…  opPonal.   •  •  •  43 Don’t  know?  That’s  okay.   Schema  on  read.   Be  able  to  extract  what  you  need  
  • 44. Lack  of  Use  Case   •  “Big  Data”  and  Hadoop   •  •  •  •  •  Have  a  plan   •  •  44 They  ENABLE  you  to  solve  problems   Won’t  solve  problems  for  you   Doesn’t  know  about  your  business  logic   “Big”  is  bigger  than  you’re  accustomed  to…   Bring  your  use  cases   Bring  your  business  quesPons  
  • 45. Index  GeneraPon/Serving   One  typical  Hadoop  use  case.   45
  • 46. eBay  –  Cassini  Project   •  June  2012   •  •  •  2B  page  views/day   250M  searches/day   9  PB  online   Custom  search  indexes   •  Limited  by  field  or  Pme  period   •  46
  • 47. eBay  –  Cassini  Project   •  MapReduce  to  generate  indexes   •  •  Customer  history   Item  fields:  name,  price,  descripPons,  etc   Bulk  import  indexes  into  HBase,  served   •  15  TB  in  HBase,  1.2  TB  daily  import  into  Hbase   •  Ranking  algorithms  can  take  into  account   •  •  •  •  47 More  history   More  fields   More  customer-­‐specific  details  
  • 48. Search  Use  Cases   Some  quick  examples.   48
  • 49. Search  Use  Cases   Powerful,  proven  search  capabili5es  that   let  organiza5ons:   Offer  easy  access  to  non-­‐technical   resources   Explore  data  prior  to  processing  and   modeling   Gain  immediate  access  and  find   correlaPons  in  mission-­‐criPcal  data   49
  • 50. Monsanto   Scalable,  efficient  image  search  for   analysis  and  research   Track  plant  characterisPcs  throughout  their   lifecycle   Before:  Manual  aTribute  extracPon  and  search   queries  within  database   Now:  Parse  and  index  images  at  acquisiPon  and   on  demand,  index  archived  images  in  batch   50
  • 51. Custom  Aggregated  Search   Cloudera:  Internal  Field  Portal   51
  • 52. Cloudera  –  Internal  Field  Portal   •  Single  stop  for  field  engineers   •  •  •  •  •  •  Mailing  lists:  public,  private   Tickets:  support,  development,  public  ASF   Customer  data:  accounts,  clusters,  KB  arPcles   Customer  Clusters:  configs,  audits,  logs,  events   Books  and  papers   Discussion  forums   Dogfooding,  yes   •  Makes  my  life  easier   •  52
  • 53. Cloudera  –  Internal  Field  Portal   53  
  • 54. Cloudera  –  Internal  Field  Portal   •  Varied  fetchers/observers  for  web/API  content   •  •  Content  is  retrieved  via  Flume,  Sqoop   Search  indexes  and  replicates  into  HBase   •  •  Each  collecPon  has  collecPon-­‐specific  filters/fields   Provides  Ptle,  content  snippet,  link  to  original   Morphlines  extracts  books  and  papers  using  Tika   •  Impala  for  analyPcs   •  •  54 Future:  Use  MapReduce  to  ingest  logs  
  • 55. Risk  ClassificaPon  &  PredicPve  Analysis   PaTerns  &  PredicPons:  Durkheim  Project   55
  • 56. 2012   56 US  Combat  Deaths  AFG   301     Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
  • 57. 2012   57 US  Combat  Deaths  AFG   301    US  Military  Suicides   349   Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
  • 58. 2012   US  Combat  Deaths  AFG   301   US  Military  Suicides   349     349  >  301   58 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
  • 59. PaTerns  &  PredicPons  –  Durkheim  Project   Assessment  of  mental  health  risks   •  Correlate  veterans’  communicaPons  with  suicide  risk   •  59
  • 60. PaTerns  &  PredicPons  –  Durkheim  Project   Build  machine  learning  algorithms  on  MapReduce   •  Train  using  expert  knowledge   •  •  •  •  Algorithm  detects  and  assign  risk  scores   •  60 Keywords   PaTerns   In  what  medium?  
  • 61. Unstructured   PaTerns  &  PredicPons  –  Durkheim  Project   Clinical   Notes   61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/
  • 62. PaTerns  &  PredicPons  –  Durkheim  Project   •  Phase  1   •  •  •  •  3  cohorts:  non-­‐psychiatric,  psychiatric,  suicide-­‐posiPve   100  clinical  profiles  per  cohort   65%  accurate  in  predicPng  suicide  risk  in  control  group   Phase  2     •  •  •  Text  analyPcs  of  clinical  records,  opt-­‐in  social  media   Goal  of  100,000  veteran  parPcipants   Represents  a  huge  increase  of  data   •  62 TradiPonal  enterprise  search  couldn’t  scale  
  • 63. PaTerns  &  PredicPons  –  Durkheim  Project   •  Technologies   •  •  Hadoop   Search   •  •  •  •  Impala   •  •  •  63 Indexing  of  machine  learning,  backed  by  HBase  for  performance   Hue  interface  for  non-­‐technical  users   Discovery  of  terms,  keywords,  risk  factors  in  numerous  facets   Deep  SQL  queries  if/when  interesPng  deviaPons  are  found   e.g.  if  the  word  “Molly”  appeared  in  top  10  facets   Write  some  SQL  to  dig  in,  perhaps  revise  indexing  scheme  
  • 64. PaTerns  &  PredicPons  –  Durkheim  Project   •  Currently   •  •  •  Future   •  •  •  IntervenPonal  study   Back  our  hopes  with  data…   More  detailed  Case  Study   •  •  64 Monitoring   Analysis   hTp://goo.gl/3ZJMwS   hTp://durkheimproject.org/  
  • 65. Summary   ParPng  thoughts…  in  no  parPcular  order.   65
  • 66. Search  Simplifies  InteracPon   Explore   Navigate   Correlate   Experts  know  MapReduce.  Savvy  people  know  SQL.     Everyone  knows  Search.   66
  • 67. Summary   •  •  •  With  Hadoop,  it  depends.   The  tools  are  out  there.   Open  source  soKware   •  •  •  •  •  Data  can  make  a  difference.   Search  allows  everyone  to  interact  with  data.   •  67 Many  interconnected  pieces   Many  unexplored  opportuniPes   A  thriving  community  awaits  you…   This  is  a  Big  Deal.  
  • 68. What’s  Next?   •  Download  Hadoop!   •  •  •  CDH  available  at  www.cloudera.com   Cloudera  provides  pre-­‐loaded  VMs   •  •  hTp://Pny.cloudera.com/quickstartvm   Clone  our  repos!   •  68 Already  done  that?  Contribute…   hTps://github.com/cloudera  
  • 69. QuesPons?   Preferably  related  to  the  talk…   69
  • 70. Thank  You!   Alex  Moundalexis   alexm@clouderagovt.com   @technmsg     We’re  hiring,  kids!  Well,  not  kids.   70