Big	
  Data	
  App	
  Server	
  
Lance	
  Riedel	
  
Big Data App Server
A	
  new	
  applica5on	
  framework	
  for	
  (4	
  V’s):	
  
•  Volume	
  of	
  raw	
  data	
  (Petab...
Big Data App Server Use Cases
•  Log/Machine	
  Analy5cs	
  
•  Security/Fraud	
  Detec5on	
  
•  Sensor	
  Data	
  Analy5...
ComponentsBigDataPlatform
APP	
  SERVER	
  COMPONENTS	
  	
  
Storage and ComputeBigDataPlatform
Storage and Compute
Mo8va8on	
  
Google	
  needed	
  to	
  capture	
  the	
  web	
  and	
  
process	
  it	
  efficiently	
  ...
Storage/Compute: Centralized
•  Centralized	
  doesn’t	
  scale!	
  	
  
•  Move	
  a	
  lot	
  of	
  data	
  –	
  boWlene...
Storage/Compute: Sharding
•  Sharding	
  is	
  spliXng	
  the	
  problem	
  into	
  isolated	
  chunks	
  
•  Sharding	
  ...
DFS, MapReduce
•  Used	
  a	
  new	
  programming	
  model	
  to	
  
distribute	
  computa5on	
  AND	
  data	
  (NOT	
  
s...
Distributed File System
MapReduce
•  Process	
  where	
  the	
  data	
  resides	
  (Data	
  and	
  compute	
  are	
  local	
  to	
  each	
  other)...
Hadoop
•  Open	
  Source	
  implementa5on	
  of	
  
Google’s	
  DFS	
  and	
  MapReduce	
  
whitepaper	
  
•  Huge	
  Eco-...
ManagementBigDataPlatform
Data Ingestion
Mo8va8on	
  
•  Data	
  origina5ng	
  from	
  a	
  
variety	
  of	
  sources	
  
	
  
•  Some	
  data	
  mo...
Data Ingestion: Apache Flume
•  A	
  scalable,	
  fault-­‐tolerant,	
  configurable	
  topology	
  
data	
  inges5on	
  pip...
Workflow
Mo8va8on	
  
Transforming,	
  storing,	
  joining,	
  data	
  can	
  take	
  a	
  lot	
  
of	
  steps	
  that	
  ...
Workflow: Oozie
A	
  workflow	
  engine	
  that	
  understands	
  the	
  
dependency	
  graph	
  of	
  work	
  and	
  can	
...
Schema Management
Mo8va8on	
  
As	
  data	
  sources	
  explode,	
  the	
  need	
  to	
  understand	
  
the	
  data	
  sch...
Schema: HCatalog
•  A	
  table	
  and	
  storage	
  management	
  layer	
  for	
  
Hadoop	
  	
  
•  Enables	
  users	
  w...
Schema: Avro
	
  
•  A	
  data	
  serializa5on	
  system	
  
•  When	
  Avro	
  data	
  is	
  stored	
  in	
  a	
  file,	
 ...
Data Access, QueryingBigDataPlatform
Data Access
Mo8va8on	
  
Various	
  data	
  access	
  paWerns	
  require	
  data	
  stores	
  
beyond	
  just	
  the	
  DF...
Data Access: HBase
•  The	
  Hadoop	
  database	
  -­‐	
  a	
  distributed,	
  
scalable,	
  big	
  data	
  store	
  (sort...
Data Access: SQL – Hive, Impala
•  SQL	
  querying	
  of	
  raw	
  data	
  on	
  the	
  
distributed	
  file	
  system	
  
...
AnalyticsBigDataPlatform
Data Analytics
Mo8va8on	
  
•  Discover	
  the	
  latent	
  value	
  of	
  the	
  data.	
  The	
  core	
  
mo5va5on	
  beh...
Data Analytics: Example
Frameworks
•  Mahout	
  
•  Machine	
  learning,	
  clustering	
  
•  PaWern	
  -­‐	
  Machine	
  ...
ServingBigDataPlatform
Serving
Mo8va8on	
  
•  Powering	
  applica5ons	
  for	
  end	
  users	
  
•  Search/browse	
  and	
  recommenda5on	
  eng...
Serving: Search – Solr
Cloud
•  Builds	
  indexes	
  on	
  top	
  of	
  Hadoop	
  
•  Horizontally	
  scalable,	
  fault	
...
Serving: Manas – Matching Engine
•  The	
  Hive’s	
  massively	
  scalable	
  
matching	
  engine	
  	
  
•  Handles	
  10...
EXAMPLE	
  APP	
  USE-­‐CASE	
  
App Server Data Flow
SecurityX on App Server
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Upcoming SlideShare
Loading in …5
×

Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

732 views
602 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
732
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

  1. 1. Big  Data  App  Server   Lance  Riedel  
  2. 2. Big Data App Server A  new  applica5on  framework  for  (4  V’s):   •  Volume  of  raw  data  (Petabytes)   •  Velocity  at  which  it  is  being  generated/ ingested     •  Variety  of  data  sources  and  schemas   •  Advanced  data  sciences  and  analy5cs  that   can  be  applied  to  extract  Value    
  3. 3. Big Data App Server Use Cases •  Log/Machine  Analy5cs   •  Security/Fraud  Detec5on   •  Sensor  Data  Analy5cs   •  Financial  Analy5cs   •  Retail  Analy5cs   •  Ad  Targe5ng   •  Recommenda5on  (e.g.  NeMlix,  Amazon)    
  4. 4. ComponentsBigDataPlatform
  5. 5. APP  SERVER  COMPONENTS    
  6. 6. Storage and ComputeBigDataPlatform
  7. 7. Storage and Compute Mo8va8on   Google  needed  to  capture  the  web  and   process  it  efficiently     •  Calculate  importance  of  pages,  words,   domains  against  each  other   •  The  more  cost-­‐effec5ve  they  could  make   it  -­‐  the  more  they  could  process,  index,   understand    
  8. 8. Storage/Compute: Centralized •  Centralized  doesn’t  scale!     •  Move  a  lot  of  data  –  boWleneck  
  9. 9. Storage/Compute: Sharding •  Sharding  is  spliXng  the  problem  into  isolated  chunks   •  Sharding  scales,  but  fails  when  you  need  to  look   across  the  data   •  E.G.  How  to  calculate  term  weights  or  top  pages   across  shards??   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ≠  
  10. 10. DFS, MapReduce •  Used  a  new  programming  model  to   distribute  computa5on  AND  data  (NOT   sharding)   •  Runs  on  commodity  hardware     •  Failure  resilience  using  so_ware  control   •  Easy  to  calculate  across  corpus     •  Two  parts  of  a  complete  Solu5on:   •  Distributed  File  System  –  DFS   •  MapReduce  
  11. 11. Distributed File System
  12. 12. MapReduce •  Process  where  the  data  resides  (Data  and  compute  are  local  to  each  other)   •  Map  (read  the  data,  emit  a  key  and  a  value)   •  Reduce  (group  all  values  per  key,  perform  another  opera5on)  
  13. 13. Hadoop •  Open  Source  implementa5on  of   Google’s  DFS  and  MapReduce   whitepaper   •  Huge  Eco-­‐System   •  Used  by:  Yahoo,  Facebook,  TwiWer,   LinkedIn,  Sears,  Apple,  The  New  York   Times,  Telefonica,  +1000’s  more!  
  14. 14. ManagementBigDataPlatform
  15. 15. Data Ingestion Mo8va8on   •  Data  origina5ng  from  a   variety  of  sources     •  Some  data  more   valuable  than  others:   •  Time-­‐to-­‐live  (TTL)   •  Guarantees  on   delivery  
  16. 16. Data Ingestion: Apache Flume •  A  scalable,  fault-­‐tolerant,  configurable  topology   data  inges5on  pipeline  that  works  hand  in  hand  with   the  Hadoop  Eco-­‐System   •  Configurable  delivery  guarantees      -­‐  rou5ng,  replica5on,  failover   •  Extensible  sources  and  sinks  allows  for  pluggable   data  sources   •  Scales  out  horizontally  –  100k’s  messages/sec  
  17. 17. Workflow Mo8va8on   Transforming,  storing,  joining,  data  can  take  a  lot   of  steps  that  need  to  be  repeatable  and  traceable  –   the  programming  model  for  data      
  18. 18. Workflow: Oozie A  workflow  engine  that  understands  the   dependency  graph  of  work  and  can  schedule,   replay,  and  report  on  the  steps     •  Jobs  triggered  by  5me  (frequency)  and  data   availability   •  Integrated  with  the  rest  of  the  Hadoop  stack   •  Scalable,  reliable  and  extensible  system.            
  19. 19. Schema Management Mo8va8on   As  data  sources  explode,  the  need  to  understand   the  data  schemas  becomes  a  principle  concern    
  20. 20. Schema: HCatalog •  A  table  and  storage  management  layer  for   Hadoop     •  Enables  users  with  different  data   processing  tools  –  Pig,  MapReduce,  and   Hive  –  to  more  easily  read  and  write  data   on  the  grid.            
  21. 21. Schema: Avro   •  A  data  serializa5on  system   •  When  Avro  data  is  stored  in  a  file,  its   schema  is  stored  with  it   •  Correspondence  between  same  named   fields,  missing  fields,  extra  fields,  etc.  can   all  be  easily  resolved.   •  Most  technologies  in  the  Hadoop  stack     understand  avro–  interoperability/data   passing    
  22. 22. Data Access, QueryingBigDataPlatform
  23. 23. Data Access Mo8va8on   Various  data  access  paWerns  require  data  stores   beyond  just  the  DFS  files.  An  example  is  a  key  value   store  that  needs  random  access  to  data.     Solu8on(s)   There  are  a  number  of  solu5ons  depending  on  the   use  case.     •  Google’s  BigTable  whitepaper   •  SQL  has  been  adapted  to  Hadoop    
  24. 24. Data Access: HBase •  The  Hadoop  database  -­‐  a  distributed,   scalable,  big  data  store  (sorted  map)  –   from  Google’s  BigTable,  backed  by  Hadoop   DFS   •  Linear  and  modular  scalability.   •  Automa5c  and  configurable  sharding  of   tables   •  Automa5c  failover  support     •  Convenient  base  classes  for  backing   Hadoop  MapReduce  jobs  with  Apache   HBase  tables.  
  25. 25. Data Access: SQL – Hive, Impala •  SQL  querying  of  raw  data  on  the   distributed  file  system   •  Impala  –  Query  files  on  HDFS  including   SELECT,  JOIN,  and  aggregate  func5ons  –  in   real  5me   •  Hive  –  provides  easy  data  summariza5on,   ad-­‐hoc  queries,  and  the  analysis  of  large   datasets  stored  in  Hadoop  compa5ble  file   systems  
  26. 26. AnalyticsBigDataPlatform
  27. 27. Data Analytics Mo8va8on   •  Discover  the  latent  value  of  the  data.  The  core   mo5va5on  behind  Big  Data!   •  Clustering,  Machine  Learning,  Correla5ons,   Modeling  –  the  guts  of  the  Data  Science  –  o_en   extremely  diverse  use  cases.       Solu8on(s)   A  pluggable  architecture  that  can  share  schemas,   but  allow  for  a  suite  of  tools  appropriate  for  the   use  case  
  28. 28. Data Analytics: Example Frameworks •  Mahout   •  Machine  learning,  clustering   •  PaWern  -­‐  Machine  Learning  DSL  for  Hadoop  from   Cascading   •  0xData   •  Open  source  math  and  predic5on  engine  for  big  data   •  Sample  Algorithms   •  Random  Forest  algorithm   •  K-­‐Means  Clustering   •  Hierarchical  Clustering   •  Linear  Regression   •  Logis5c  Regression   •  Support  Vector  Machines   •  Ar5ficial  Neural  Networks   •  Associa5on  Rule  Learning  
  29. 29. ServingBigDataPlatform
  30. 30. Serving Mo8va8on   •  Powering  applica5ons  for  end  users   •  Search/browse  and  recommenda5on  engines   allow  real-­‐5me  access  to  data    
  31. 31. Serving: Search – Solr Cloud •  Builds  indexes  on  top  of  Hadoop   •  Horizontally  scalable,  fault  tolerant   •  Incredible  flexibility  in  indexing  op5ons   •  Tokeniza5on   •  Field  types   •  Data  storage   •  Search  op5ons  just  as  flexible   •  AND,OR,NOT,  wildcard   •  Facets  (counts  from  a  derived  ontology)   •  Extensive  algorithm  and  weigh5ng  plug-­‐ ability  
  32. 32. Serving: Manas – Matching Engine •  The  Hive’s  massively  scalable   matching  engine     •  Handles  100’s  millions  to  billions  of   documents  efficiently  while  matching   against  100’s  to  1000’s  features   •  Nothing  exists  today  in  the  Open   Source  community  that  has  these   capabili5es  
  33. 33. EXAMPLE  APP  USE-­‐CASE  
  34. 34. App Server Data Flow
  35. 35. SecurityX on App Server

×