Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event



Published in Technology
  • 1. Big  Data  App  Server   Lance  Riedel  
  • 2. Big Data App Server A  new  applica5on  framework  for  (4  V’s):   •  Volume  of  raw  data  (Petabytes)   •  Velocity  at  which  it  is  being  generated/ ingested     •  Variety  of  data  sources  and  schemas   •  Advanced  data  sciences  and  analy5cs  that   can  be  applied  to  extract  Value    
  • 3. Big Data App Server Use Cases •  Log/Machine  Analy5cs   •  Security/Fraud  Detec5on   •  Sensor  Data  Analy5cs   •  Financial  Analy5cs   •  Retail  Analy5cs   •  Ad  Targe5ng   •  Recommenda5on  (e.g.  NeMlix,  Amazon)    
  • 4. ComponentsBigDataPlatform
  • 6. Storage and ComputeBigDataPlatform
  • 7. Storage and Compute Mo8va8on   Google  needed  to  capture  the  web  and   process  it  efficiently     •  Calculate  importance  of  pages,  words,   domains  against  each  other   •  The  more  cost-­‐effec5ve  they  could  make   it  -­‐  the  more  they  could  process,  index,   understand    
  • 8. Storage/Compute: Centralized •  Centralized  doesn’t  scale!     •  Move  a  lot  of  data  –  boWleneck  
  • 9. Storage/Compute: Sharding •  Sharding  is  spliXng  the  problem  into  isolated  chunks   •  Sharding  scales,  but  fails  when  you  need  to  look   across  the  data   •  E.G.  How  to  calculate  term  weights  or  top  pages   across  shards??   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ≠  
  • 10. DFS, MapReduce •  Used  a  new  programming  model  to   distribute  computa5on  AND  data  (NOT   sharding)   •  Runs  on  commodity  hardware     •  Failure  resilience  using  so_ware  control   •  Easy  to  calculate  across  corpus     •  Two  parts  of  a  complete  Solu5on:   •  Distributed  File  System  –  DFS   •  MapReduce  
  • 11. Distributed File System
  • 12. MapReduce •  Process  where  the  data  resides  (Data  and  compute  are  local  to  each  other)   •  Map  (read  the  data,  emit  a  key  and  a  value)   •  Reduce  (group  all  values  per  key,  perform  another  opera5on)  
  • 13. Hadoop •  Open  Source  implementa5on  of   Google’s  DFS  and  MapReduce   whitepaper   •  Huge  Eco-­‐System   •  Used  by:  Yahoo,  Facebook,  TwiWer,   LinkedIn,  Sears,  Apple,  The  New  York   Times,  Telefonica,  +1000’s  more!  
  • 14. ManagementBigDataPlatform
  • 15. Data Ingestion Mo8va8on   •  Data  origina5ng  from  a   variety  of  sources     •  Some  data  more   valuable  than  others:   •  Time-­‐to-­‐live  (TTL)   •  Guarantees  on   delivery  
  • 16. Data Ingestion: Apache Flume •  A  scalable,  fault-­‐tolerant,  configurable  topology   data  inges5on  pipeline  that  works  hand  in  hand  with   the  Hadoop  Eco-­‐System   •  Configurable  delivery  guarantees      -­‐  rou5ng,  replica5on,  failover   •  Extensible  sources  and  sinks  allows  for  pluggable   data  sources   •  Scales  out  horizontally  –  100k’s  messages/sec  
  • 17. Workflow Mo8va8on   Transforming,  storing,  joining,  data  can  take  a  lot   of  steps  that  need  to  be  repeatable  and  traceable  –   the  programming  model  for  data      
  • 18. Workflow: Oozie A  workflow  engine  that  understands  the   dependency  graph  of  work  and  can  schedule,   replay,  and  report  on  the  steps     •  Jobs  triggered  by  5me  (frequency)  and  data   availability   •  Integrated  with  the  rest  of  the  Hadoop  stack   •  Scalable,  reliable  and  extensible  system.            
  • 19. Schema Management Mo8va8on   As  data  sources  explode,  the  need  to  understand   the  data  schemas  becomes  a  principle  concern    
  • 20. Schema: HCatalog •  A  table  and  storage  management  layer  for   Hadoop     •  Enables  users  with  different  data   processing  tools  –  Pig,  MapReduce,  and   Hive  –  to  more  easily  read  and  write  data   on  the  grid.            
  • 21. Schema: Avro   •  A  data  serializa5on  system   •  When  Avro  data  is  stored  in  a  file,  its   schema  is  stored  with  it   •  Correspondence  between  same  named   fields,  missing  fields,  extra  fields,  etc.  can   all  be  easily  resolved.   •  Most  technologies  in  the  Hadoop  stack     understand  avro–  interoperability/data   passing    
  • 22. Data Access, QueryingBigDataPlatform
  • 23. Data Access Mo8va8on   Various  data  access  paWerns  require  data  stores   beyond  just  the  DFS  files.  An  example  is  a  key  value   store  that  needs  random  access  to  data.     Solu8on(s)   There  are  a  number  of  solu5ons  depending  on  the   use  case.     •  Google’s  BigTable  whitepaper   •  SQL  has  been  adapted  to  Hadoop    
  • 24. Data Access: HBase •  The  Hadoop  database  -­‐  a  distributed,   scalable,  big  data  store  (sorted  map)  –   from  Google’s  BigTable,  backed  by  Hadoop   DFS   •  Linear  and  modular  scalability.   •  Automa5c  and  configurable  sharding  of   tables   •  Automa5c  failover  support     •  Convenient  base  classes  for  backing   Hadoop  MapReduce  jobs  with  Apache   HBase  tables.  
  • 25. Data Access: SQL – Hive, Impala •  SQL  querying  of  raw  data  on  the   distributed  file  system   •  Impala  –  Query  files  on  HDFS  including   SELECT,  JOIN,  and  aggregate  func5ons  –  in   real  5me   •  Hive  –  provides  easy  data  summariza5on,   ad-­‐hoc  queries,  and  the  analysis  of  large   datasets  stored  in  Hadoop  compa5ble  file   systems  
  • 26. AnalyticsBigDataPlatform
  • 27. Data Analytics Mo8va8on   •  Discover  the  latent  value  of  the  data.  The  core   mo5va5on  behind  Big  Data!   •  Clustering,  Machine  Learning,  Correla5ons,   Modeling  –  the  guts  of  the  Data  Science  –  o_en   extremely  diverse  use  cases.       Solu8on(s)   A  pluggable  architecture  that  can  share  schemas,   but  allow  for  a  suite  of  tools  appropriate  for  the   use  case  
  • 28. Data Analytics: Example Frameworks •  Mahout   •  Machine  learning,  clustering   •  PaWern  -­‐  Machine  Learning  DSL  for  Hadoop  from   Cascading   •  0xData   •  Open  source  math  and  predic5on  engine  for  big  data   •  Sample  Algorithms   •  Random  Forest  algorithm   •  K-­‐Means  Clustering   •  Hierarchical  Clustering   •  Linear  Regression   •  Logis5c  Regression   •  Support  Vector  Machines   •  Ar5ficial  Neural  Networks   •  Associa5on  Rule  Learning  
  • 29. ServingBigDataPlatform
  • 30. Serving Mo8va8on   •  Powering  applica5ons  for  end  users   •  Search/browse  and  recommenda5on  engines   allow  real-­‐5me  access  to  data    
  • 31. Serving: Search – Solr Cloud •  Builds  indexes  on  top  of  Hadoop   •  Horizontally  scalable,  fault  tolerant   •  Incredible  flexibility  in  indexing  op5ons   •  Tokeniza5on   •  Field  types   •  Data  storage   •  Search  op5ons  just  as  flexible   •  AND,OR,NOT,  wildcard   •  Facets  (counts  from  a  derived  ontology)   •  Extensive  algorithm  and  weigh5ng  plug-­‐ ability  
  • 32. Serving: Manas – Matching Engine •  The  Hive’s  massively  scalable   matching  engine     •  Handles  100’s  millions  to  billions  of   documents  efficiently  while  matching   against  100’s  to  1000’s  features   •  Nothing  exists  today  in  the  Open   Source  community  that  has  these   capabili5es  
  • 33. EXAMPLE  APP  USE-­‐CASE  
  • 34. App Server Data Flow
  • 35. SecurityX on App Server