Big	
  Data	
  App	
  Server	
  
Lance	
  Riedel	
  
Big Data App Server
A	
  new	
  applica5on	
  framework	
  for	
  (4	
  V’s):	
  
•  Volume	
  of	
  raw	
  data	
  (Petabytes)	
  
•  Velocity	
  at	
  which	
  it	
  is	
  being	
  generated/
ingested	
  	
  
•  Variety	
  of	
  data	
  sources	
  and	
  schemas	
  
•  Advanced	
  data	
  sciences	
  and	
  analy5cs	
  that	
  
can	
  be	
  applied	
  to	
  extract	
  Value	
  
	
  
Big Data App Server Use Cases
•  Log/Machine	
  Analy5cs	
  
•  Security/Fraud	
  Detec5on	
  
•  Sensor	
  Data	
  Analy5cs	
  
•  Financial	
  Analy5cs	
  
•  Retail	
  Analy5cs	
  
•  Ad	
  Targe5ng	
  
•  Recommenda5on	
  (e.g.	
  NeMlix,	
  Amazon)	
  
	
  
ComponentsBigDataPlatform
APP	
  SERVER	
  COMPONENTS	
  	
  
Storage and ComputeBigDataPlatform
Storage and Compute
Mo8va8on	
  
Google	
  needed	
  to	
  capture	
  the	
  web	
  and	
  
process	
  it	
  efficiently	
  
	
  
•  Calculate	
  importance	
  of	
  pages,	
  words,	
  
domains	
  against	
  each	
  other	
  
•  The	
  more	
  cost-­‐effec5ve	
  they	
  could	
  make	
  
it	
  -­‐	
  the	
  more	
  they	
  could	
  process,	
  index,	
  
understand	
  
	
  
Storage/Compute: Centralized
•  Centralized	
  doesn’t	
  scale!	
  	
  
•  Move	
  a	
  lot	
  of	
  data	
  –	
  boWleneck	
  
Storage/Compute: Sharding
•  Sharding	
  is	
  spliXng	
  the	
  problem	
  into	
  isolated	
  chunks	
  
•  Sharding	
  scales,	
  but	
  fails	
  when	
  you	
  need	
  to	
  look	
  
across	
  the	
  data	
  
•  E.G.	
  How	
  to	
  calculate	
  term	
  weights	
  or	
  top	
  pages	
  
across	
  shards??	
  
✓	
   ✓	
   ✓	
   ✓	
   ✓	
   ✓	
   ✓	
  
≠	
  
DFS, MapReduce
•  Used	
  a	
  new	
  programming	
  model	
  to	
  
distribute	
  computa5on	
  AND	
  data	
  (NOT	
  
sharding)	
  
•  Runs	
  on	
  commodity	
  hardware	
  	
  
•  Failure	
  resilience	
  using	
  so_ware	
  control	
  
•  Easy	
  to	
  calculate	
  across	
  corpus	
  	
  
•  Two	
  parts	
  of	
  a	
  complete	
  Solu5on:	
  
•  Distributed	
  File	
  System	
  –	
  DFS	
  
•  MapReduce	
  
Distributed File System
MapReduce
•  Process	
  where	
  the	
  data	
  resides	
  (Data	
  and	
  compute	
  are	
  local	
  to	
  each	
  other)	
  
•  Map	
  (read	
  the	
  data,	
  emit	
  a	
  key	
  and	
  a	
  value)	
  
•  Reduce	
  (group	
  all	
  values	
  per	
  key,	
  perform	
  another	
  opera5on)	
  
Hadoop
•  Open	
  Source	
  implementa5on	
  of	
  
Google’s	
  DFS	
  and	
  MapReduce	
  
whitepaper	
  
•  Huge	
  Eco-­‐System	
  
•  Used	
  by:	
  Yahoo,	
  Facebook,	
  TwiWer,	
  
LinkedIn,	
  Sears,	
  Apple,	
  The	
  New	
  York	
  
Times,	
  Telefonica,	
  +1000’s	
  more!	
  
ManagementBigDataPlatform
Data Ingestion
Mo8va8on	
  
•  Data	
  origina5ng	
  from	
  a	
  
variety	
  of	
  sources	
  
	
  
•  Some	
  data	
  more	
  
valuable	
  than	
  others:	
  
•  Time-­‐to-­‐live	
  (TTL)	
  
•  Guarantees	
  on	
  
delivery	
  
Data Ingestion: Apache Flume
•  A	
  scalable,	
  fault-­‐tolerant,	
  configurable	
  topology	
  
data	
  inges5on	
  pipeline	
  that	
  works	
  hand	
  in	
  hand	
  with	
  
the	
  Hadoop	
  Eco-­‐System	
  
•  Configurable	
  delivery	
  guarantees	
  
	
   	
  -­‐	
  rou5ng,	
  replica5on,	
  failover	
  
•  Extensible	
  sources	
  and	
  sinks	
  allows	
  for	
  pluggable	
  
data	
  sources	
  
•  Scales	
  out	
  horizontally	
  –	
  100k’s	
  messages/sec	
  
Workflow
Mo8va8on	
  
Transforming,	
  storing,	
  joining,	
  data	
  can	
  take	
  a	
  lot	
  
of	
  steps	
  that	
  need	
  to	
  be	
  repeatable	
  and	
  traceable	
  –	
  
the	
  programming	
  model	
  for	
  data	
  
	
  
	
  
Workflow: Oozie
A	
  workflow	
  engine	
  that	
  understands	
  the	
  
dependency	
  graph	
  of	
  work	
  and	
  can	
  schedule,	
  
replay,	
  and	
  report	
  on	
  the	
  steps	
  
	
  
•  Jobs	
  triggered	
  by	
  5me	
  (frequency)	
  and	
  data	
  
availability	
  
•  Integrated	
  with	
  the	
  rest	
  of	
  the	
  Hadoop	
  stack	
  
•  Scalable,	
  reliable	
  and	
  extensible	
  system.	
  
	
  	
  
	
  
	
  
	
  
Schema Management
Mo8va8on	
  
As	
  data	
  sources	
  explode,	
  the	
  need	
  to	
  understand	
  
the	
  data	
  schemas	
  becomes	
  a	
  principle	
  concern	
  
	
  
Schema: HCatalog
•  A	
  table	
  and	
  storage	
  management	
  layer	
  for	
  
Hadoop	
  	
  
•  Enables	
  users	
  with	
  different	
  data	
  
processing	
  tools	
  –	
  Pig,	
  MapReduce,	
  and	
  
Hive	
  –	
  to	
  more	
  easily	
  read	
  and	
  write	
  data	
  
on	
  the	
  grid.	
  	
  
	
  
	
  
	
  
	
  
Schema: Avro
	
  
•  A	
  data	
  serializa5on	
  system	
  
•  When	
  Avro	
  data	
  is	
  stored	
  in	
  a	
  file,	
  its	
  
schema	
  is	
  stored	
  with	
  it	
  
•  Correspondence	
  between	
  same	
  named	
  
fields,	
  missing	
  fields,	
  extra	
  fields,	
  etc.	
  can	
  
all	
  be	
  easily	
  resolved.	
  
•  Most	
  technologies	
  in	
  the	
  Hadoop	
  stack	
  	
  
understand	
  avro–	
  interoperability/data	
  
passing	
  
	
  
Data Access, QueryingBigDataPlatform
Data Access
Mo8va8on	
  
Various	
  data	
  access	
  paWerns	
  require	
  data	
  stores	
  
beyond	
  just	
  the	
  DFS	
  files.	
  An	
  example	
  is	
  a	
  key	
  value	
  
store	
  that	
  needs	
  random	
  access	
  to	
  data.	
  
	
  
Solu8on(s)	
  
There	
  are	
  a	
  number	
  of	
  solu5ons	
  depending	
  on	
  the	
  
use	
  case.	
  	
  
•  Google’s	
  BigTable	
  whitepaper	
  
•  SQL	
  has	
  been	
  adapted	
  to	
  Hadoop	
  	
  
Data Access: HBase
•  The	
  Hadoop	
  database	
  -­‐	
  a	
  distributed,	
  
scalable,	
  big	
  data	
  store	
  (sorted	
  map)	
  –	
  
from	
  Google’s	
  BigTable,	
  backed	
  by	
  Hadoop	
  
DFS	
  
•  Linear	
  and	
  modular	
  scalability.	
  
•  Automa5c	
  and	
  configurable	
  sharding	
  of	
  
tables	
  
•  Automa5c	
  failover	
  support	
  	
  
•  Convenient	
  base	
  classes	
  for	
  backing	
  
Hadoop	
  MapReduce	
  jobs	
  with	
  Apache	
  
HBase	
  tables.	
  
Data Access: SQL – Hive, Impala
•  SQL	
  querying	
  of	
  raw	
  data	
  on	
  the	
  
distributed	
  file	
  system	
  
•  Impala	
  –	
  Query	
  files	
  on	
  HDFS	
  including	
  
SELECT,	
  JOIN,	
  and	
  aggregate	
  func5ons	
  –	
  in	
  
real	
  5me	
  
•  Hive	
  –	
  provides	
  easy	
  data	
  summariza5on,	
  
ad-­‐hoc	
  queries,	
  and	
  the	
  analysis	
  of	
  large	
  
datasets	
  stored	
  in	
  Hadoop	
  compa5ble	
  file	
  
systems	
  
AnalyticsBigDataPlatform
Data Analytics
Mo8va8on	
  
•  Discover	
  the	
  latent	
  value	
  of	
  the	
  data.	
  The	
  core	
  
mo5va5on	
  behind	
  Big	
  Data!	
  
•  Clustering,	
  Machine	
  Learning,	
  Correla5ons,	
  
Modeling	
  –	
  the	
  guts	
  of	
  the	
  Data	
  Science	
  –	
  o_en	
  
extremely	
  diverse	
  use	
  cases.	
  	
  
	
  
Solu8on(s)	
  
A	
  pluggable	
  architecture	
  that	
  can	
  share	
  schemas,	
  
but	
  allow	
  for	
  a	
  suite	
  of	
  tools	
  appropriate	
  for	
  the	
  
use	
  case	
  
Data Analytics: Example
Frameworks
•  Mahout	
  
•  Machine	
  learning,	
  clustering	
  
•  PaWern	
  -­‐	
  Machine	
  Learning	
  DSL	
  for	
  Hadoop	
  from	
  
Cascading	
  
•  0xData	
  
•  Open	
  source	
  math	
  and	
  predic5on	
  engine	
  for	
  big	
  data	
  
•  Sample	
  Algorithms	
  
•  Random	
  Forest	
  algorithm	
  
•  K-­‐Means	
  Clustering	
  
•  Hierarchical	
  Clustering	
  
•  Linear	
  Regression	
  
•  Logis5c	
  Regression	
  
•  Support	
  Vector	
  Machines	
  
•  Ar5ficial	
  Neural	
  Networks	
  
•  Associa5on	
  Rule	
  Learning	
  
ServingBigDataPlatform
Serving
Mo8va8on	
  
•  Powering	
  applica5ons	
  for	
  end	
  users	
  
•  Search/browse	
  and	
  recommenda5on	
  engines	
  
allow	
  real-­‐5me	
  access	
  to	
  data	
  	
  
Serving: Search – Solr
Cloud
•  Builds	
  indexes	
  on	
  top	
  of	
  Hadoop	
  
•  Horizontally	
  scalable,	
  fault	
  tolerant	
  
•  Incredible	
  flexibility	
  in	
  indexing	
  op5ons	
  
•  Tokeniza5on	
  
•  Field	
  types	
  
•  Data	
  storage	
  
•  Search	
  op5ons	
  just	
  as	
  flexible	
  
•  AND,OR,NOT,	
  wildcard	
  
•  Facets	
  (counts	
  from	
  a	
  derived	
  ontology)	
  
•  Extensive	
  algorithm	
  and	
  weigh5ng	
  plug-­‐
ability	
  
Serving: Manas – Matching Engine
•  The	
  Hive’s	
  massively	
  scalable	
  
matching	
  engine	
  	
  
•  Handles	
  100’s	
  millions	
  to	
  billions	
  of	
  
documents	
  efficiently	
  while	
  matching	
  
against	
  100’s	
  to	
  1000’s	
  features	
  
•  Nothing	
  exists	
  today	
  in	
  the	
  Open	
  
Source	
  community	
  that	
  has	
  these	
  
capabili5es	
  
EXAMPLE	
  APP	
  USE-­‐CASE	
  
App Server Data Flow
SecurityX on App Server

Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

  • 1.
    Big  Data  App  Server   Lance  Riedel  
  • 2.
    Big Data AppServer A  new  applica5on  framework  for  (4  V’s):   •  Volume  of  raw  data  (Petabytes)   •  Velocity  at  which  it  is  being  generated/ ingested     •  Variety  of  data  sources  and  schemas   •  Advanced  data  sciences  and  analy5cs  that   can  be  applied  to  extract  Value    
  • 4.
    Big Data AppServer Use Cases •  Log/Machine  Analy5cs   •  Security/Fraud  Detec5on   •  Sensor  Data  Analy5cs   •  Financial  Analy5cs   •  Retail  Analy5cs   •  Ad  Targe5ng   •  Recommenda5on  (e.g.  NeMlix,  Amazon)    
  • 5.
  • 6.
  • 7.
  • 8.
    Storage and Compute Mo8va8on   Google  needed  to  capture  the  web  and   process  it  efficiently     •  Calculate  importance  of  pages,  words,   domains  against  each  other   •  The  more  cost-­‐effec5ve  they  could  make   it  -­‐  the  more  they  could  process,  index,   understand    
  • 9.
    Storage/Compute: Centralized •  Centralized  doesn’t  scale!     •  Move  a  lot  of  data  –  boWleneck  
  • 10.
    Storage/Compute: Sharding •  Sharding  is  spliXng  the  problem  into  isolated  chunks   •  Sharding  scales,  but  fails  when  you  need  to  look   across  the  data   •  E.G.  How  to  calculate  term  weights  or  top  pages   across  shards??   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ≠  
  • 11.
    DFS, MapReduce •  Used  a  new  programming  model  to   distribute  computa5on  AND  data  (NOT   sharding)   •  Runs  on  commodity  hardware     •  Failure  resilience  using  so_ware  control   •  Easy  to  calculate  across  corpus     •  Two  parts  of  a  complete  Solu5on:   •  Distributed  File  System  –  DFS   •  MapReduce  
  • 12.
  • 13.
    MapReduce •  Process  where  the  data  resides  (Data  and  compute  are  local  to  each  other)   •  Map  (read  the  data,  emit  a  key  and  a  value)   •  Reduce  (group  all  values  per  key,  perform  another  opera5on)  
  • 14.
    Hadoop •  Open  Source  implementa5on  of   Google’s  DFS  and  MapReduce   whitepaper   •  Huge  Eco-­‐System   •  Used  by:  Yahoo,  Facebook,  TwiWer,   LinkedIn,  Sears,  Apple,  The  New  York   Times,  Telefonica,  +1000’s  more!  
  • 15.
  • 16.
    Data Ingestion Mo8va8on   • Data  origina5ng  from  a   variety  of  sources     •  Some  data  more   valuable  than  others:   •  Time-­‐to-­‐live  (TTL)   •  Guarantees  on   delivery  
  • 17.
    Data Ingestion: ApacheFlume •  A  scalable,  fault-­‐tolerant,  configurable  topology   data  inges5on  pipeline  that  works  hand  in  hand  with   the  Hadoop  Eco-­‐System   •  Configurable  delivery  guarantees      -­‐  rou5ng,  replica5on,  failover   •  Extensible  sources  and  sinks  allows  for  pluggable   data  sources   •  Scales  out  horizontally  –  100k’s  messages/sec  
  • 18.
    Workflow Mo8va8on   Transforming,  storing,  joining,  data  can  take  a  lot   of  steps  that  need  to  be  repeatable  and  traceable  –   the  programming  model  for  data      
  • 19.
    Workflow: Oozie A  workflow  engine  that  understands  the   dependency  graph  of  work  and  can  schedule,   replay,  and  report  on  the  steps     •  Jobs  triggered  by  5me  (frequency)  and  data   availability   •  Integrated  with  the  rest  of  the  Hadoop  stack   •  Scalable,  reliable  and  extensible  system.            
  • 20.
    Schema Management Mo8va8on   As  data  sources  explode,  the  need  to  understand   the  data  schemas  becomes  a  principle  concern    
  • 21.
    Schema: HCatalog •  A  table  and  storage  management  layer  for   Hadoop     •  Enables  users  with  different  data   processing  tools  –  Pig,  MapReduce,  and   Hive  –  to  more  easily  read  and  write  data   on  the  grid.            
  • 22.
    Schema: Avro   • A  data  serializa5on  system   •  When  Avro  data  is  stored  in  a  file,  its   schema  is  stored  with  it   •  Correspondence  between  same  named   fields,  missing  fields,  extra  fields,  etc.  can   all  be  easily  resolved.   •  Most  technologies  in  the  Hadoop  stack     understand  avro–  interoperability/data   passing    
  • 23.
  • 24.
    Data Access Mo8va8on   Various  data  access  paWerns  require  data  stores   beyond  just  the  DFS  files.  An  example  is  a  key  value   store  that  needs  random  access  to  data.     Solu8on(s)   There  are  a  number  of  solu5ons  depending  on  the   use  case.     •  Google’s  BigTable  whitepaper   •  SQL  has  been  adapted  to  Hadoop    
  • 25.
    Data Access: HBase • The  Hadoop  database  -­‐  a  distributed,   scalable,  big  data  store  (sorted  map)  –   from  Google’s  BigTable,  backed  by  Hadoop   DFS   •  Linear  and  modular  scalability.   •  Automa5c  and  configurable  sharding  of   tables   •  Automa5c  failover  support     •  Convenient  base  classes  for  backing   Hadoop  MapReduce  jobs  with  Apache   HBase  tables.  
  • 26.
    Data Access: SQL– Hive, Impala •  SQL  querying  of  raw  data  on  the   distributed  file  system   •  Impala  –  Query  files  on  HDFS  including   SELECT,  JOIN,  and  aggregate  func5ons  –  in   real  5me   •  Hive  –  provides  easy  data  summariza5on,   ad-­‐hoc  queries,  and  the  analysis  of  large   datasets  stored  in  Hadoop  compa5ble  file   systems  
  • 27.
  • 28.
    Data Analytics Mo8va8on   • Discover  the  latent  value  of  the  data.  The  core   mo5va5on  behind  Big  Data!   •  Clustering,  Machine  Learning,  Correla5ons,   Modeling  –  the  guts  of  the  Data  Science  –  o_en   extremely  diverse  use  cases.       Solu8on(s)   A  pluggable  architecture  that  can  share  schemas,   but  allow  for  a  suite  of  tools  appropriate  for  the   use  case  
  • 29.
    Data Analytics: Example Frameworks • Mahout   •  Machine  learning,  clustering   •  PaWern  -­‐  Machine  Learning  DSL  for  Hadoop  from   Cascading   •  0xData   •  Open  source  math  and  predic5on  engine  for  big  data   •  Sample  Algorithms   •  Random  Forest  algorithm   •  K-­‐Means  Clustering   •  Hierarchical  Clustering   •  Linear  Regression   •  Logis5c  Regression   •  Support  Vector  Machines   •  Ar5ficial  Neural  Networks   •  Associa5on  Rule  Learning  
  • 30.
  • 31.
    Serving Mo8va8on   •  Powering  applica5ons  for  end  users   •  Search/browse  and  recommenda5on  engines   allow  real-­‐5me  access  to  data    
  • 32.
    Serving: Search –Solr Cloud •  Builds  indexes  on  top  of  Hadoop   •  Horizontally  scalable,  fault  tolerant   •  Incredible  flexibility  in  indexing  op5ons   •  Tokeniza5on   •  Field  types   •  Data  storage   •  Search  op5ons  just  as  flexible   •  AND,OR,NOT,  wildcard   •  Facets  (counts  from  a  derived  ontology)   •  Extensive  algorithm  and  weigh5ng  plug-­‐ ability  
  • 33.
    Serving: Manas –Matching Engine •  The  Hive’s  massively  scalable   matching  engine     •  Handles  100’s  millions  to  billions  of   documents  efficiently  while  matching   against  100’s  to  1000’s  features   •  Nothing  exists  today  in  the  Open   Source  community  that  has  these   capabili5es  
  • 34.
  • 35.
  • 36.