Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyzing Multi-Structured Data


Published on

Given its ability to analyze structured, unstructured, and "multi-structured" data, Hadoop is an increasingly viable option for analytics and business intelligence within the enterprise. Dramatically more scalable and cost-effective than traditional data warehousing technologies, Hadoop is also increasingly used to perform new kinds of analytics that were previously impossible. When it comes to Big Data, retailers are at the forefront of leveraging large volumes of nuanced information about customers, to improve the effectiveness of promotional campaigns, refine pricing models, and lower overall customer acquisition costs. Retailers compete fiercely for consumers' attention, time, and money, and effective use of analytics can result in sustained competitive advantage. Forward-thinking retailers can now take advantage of all data sources to construct a complete picture of a customer. This invariably consists of both structured data (customer and inventory records, spreadsheets, etc.) and unstructured data (clickstream logs, email archives, customer feedback and comment fields, etc.). This allows, for example, online retailers with structured, transactional sales data to connect that data with unstructured comments from product reviews, providing insight into how reviews affect consumers' propensity to purchase a particular product. This session will examine several real-world customer use cases applying combined analysis of structured and unstructured data.

Published in: Technology

Analyzing Multi-Structured Data

  1. 1.     Applying  Big  Data  Analy-cs.    Analyzing Multi-Structured Data with Hadoop Justin Borgman CEO & Co-Founder
  2. 2. Company  Profile  •  30  people,  based  in  Cambridge,  MA  •  Founded  in  July,  2010  •  Raised  $9.5M  Series  A  from  Bessemer   and  Norwest   •  CEO  &  Co-­‐Founder  •  Based  on  the  HadoopDB  research   •  Previously  spent  7  years  as  a   project  in  the  Yale  Computer  Science   soAware  developer  at  MIT   Department  by  Daniel  Abadi,  et.  al.   Lincoln  Laboratory  and  product   manager  at  startup  Covectra     •  Undergrad:  UMass  Amherst     •  Grad:  Yale  University     2
  3. 3. Big  Data:  Volume  |  Variety|  Velocity  |  VALUE   Source:  
  4. 4. Big  Data  in  the  Headlines   “How Target Figured Out A Teen Girl Was Pregnant Before Her “Digital universe” grew by 62% last year to 800K Father Did” petabytes & will grow to 1.2 zettabytes this year “Why Netflix produces BBC remake starring Kevin Spacey, directed by David Fincher” 4
  5. 5. The  Big  Data  Ecosystem  
  6. 6. Example:  Big  Data  Analysis  Process   HADOOP MPP DBMSRaw Data load extract Aggregate Sample Filter predictWeb access logsClick logsImpressionsEmail Term extractionTweets Entity extractionSensor data Sentiment analysisDocuments Geocoding Cleanse Sessionization Join Applications BI Tools Predictive analytics Business  Analyst  
  7. 7. Example:  Hadapt  Analysis  Process   Raw Data load predict Applications BI Tools Predictive analytics
  8. 8. The  Evolu-on  of  Analy-cs  –  Where  are  we  today?        The  early  stages  of  analy-cs    •  Market  Basket  Analysis  •  Trend  Analysis  •  Cyclical  Analysis  •  Customer  Segmenta-on  New  Analy-cal  Models  •  Pacern  Detec-on,  Discovery,  Matching  •  A/B  Tes-ng  and  Behavioral  Analysis  •  Sessioniza-on  •  Social  Correla-on  Analysis    •  Frac-onal  Acribu-on  •  Sen-ment  Analysis    •  Personaliza-on     8
  9. 9. Big  Data  in  Ac-on    •  Amazon  and  Ne)lix  engage  in  arbitrage  on  video  content  based  on  customer  behavior  •  Harvard  predicts  the  spread  of  cholera  in  Hai-,  and  Derwent  Capital  out-­‐trades  the  market  based   on  tweets  and  their  sen-ment  •  En-re  ecosystems  were  shotgun  gene  sequenced  by  Celera.    •  Life  events  are  predicted  by  Target  and  marketed  accordingly  •  *Osco  Drug  increased  sales  by  op-mizing  product  placement,  e.g.  beer  and  diapers  •  Ads  are  op-mally  placed  and  priced  *for  you*  by  DataXu  in  real  -me  •  Next  Big  Sound  predicts  new  ar-sts  and  hits  based  on  signals  from  social  media  •  Real-­‐-me  produc-on  op-miza-on  saves  Chevron  over  $1B/year  •  Retailer  web  sites  are  re-­‐organized  and  re-­‐op-mized  for  content  by  Bloomreach  •  LinkedIn  suggests  who  you  might  know,  eHarmony  suggests  who  you  might  love  
  10. 10. Example:  POS  Data  Insights   10
  11. 11. Example:  e-­‐Tailer   Business  Opportunity   •  Should  I  run  a  promo-on  among  the  Lady  Gaga  fans  or  Jus-n  Bieber  fans?   •  Based  on  shopping  cart  and  browsing/purchase  history,  what  other   products  should  be  recommended  before  the  customer  checks  out?   •  Which  items  are  oAen  purchased  together,  and  any  correla:on  with   shopping  date/-me,  customer  age,  gender,  etc?   Challenges   •  Diverse  data  sources   •  In-­‐depth  analy-cs  (e.g.  predic-ve  modeling)   •  Real  -me  performance  at  scale   Solu-on   –  Integrate  Hadoop  with  RDBMS   –  Develop  and  integrate  analy-c  libraries   –  Make  analy-c  jobs  interac-ve  (not  batch  oriented)   11
  12. 12. Example:  Customer  Behavior  Analysis  Business  Opportunity  •  Analyze  customer  behavior  to  increase  loyalty  and  trust,   allocate  adver-sing  spend,  op-mize  product  incen-ves,   Golden  Path  Analysis:     iden-fy  fraud,  micro-­‐segment  customer  base. ComparaSve  Performance    Challenges   ETL  +  RDBMS  &  SQL  =  200  minutes  •  Full  website  session-­‐level  data  needed,  typically  from   raw  web  logs   Hadoop  +  RDBMS  =  135  mins  •  Requires  complex  mul--­‐pass  SQL  queries  or     Hadapt  =  11  minutes     new  Non-­‐SQL  techniques  •  Requires  rewri-ng  query  to  change  number  of  clicks   Example  AnalySc  QuesSons   analyzed •  Which  life  events  are  strong  opportun-es  for   me  to  becer  engage  my  customers?  Hadapt  Value   •  When  am  I  about  to  lose  a  customer?   •  What  are  my  top  segments?  •  Performance:  Single  pass  over  data  regardless  of   •  Which  ad  campaigns  produced  the  most  liA?   number  of  clicks  analyzed   •  What  products  can  I  bundle  to  increase  sales?  •  Ease  of  Dev  &  Ease  of  Manageability:  Much  simpler   •  Are  my  online  offers  canibalizing  my  in-­‐store   code   sales?  •  Ease  of  Use:  PaPern  flexibility  to  handle  varied  numbers   •  What  models  are  my  customers  following  so  I   of  clicks  and  click  pacerns  without  requiring  any  code   can  becer  predict  their  next  move?   rewrite   12
  13. 13. Example:  Social  Media  Analysis    Business  Opportunity  •  Iden-fy  influencers  based  not  only  on  #  of  followers  and  re-­‐tweets,  but  also   messaging  content  and  sen-ment  in  reply/re-­‐tweets  •  Aggregate  individual  sen-ments  by  incorpora-ng  tweet  authors’  influence   scores  •  What  phrases  or  product  defects  do  customers  oAen  men-on  before  they   acrite?  Challenges  •  Ingest  and  analyze  high  speed  incoming  events  •  High  quality  sen-ment  output  (NLP  +  Big  Data)  •  Insights  generated  across  data  sets  Solu-on   –  Enhance  Hadoop  with  becer  interac-vity   –  Integrate  NLP  packages  to  Big  Data  plaporm   –  Ingest,  analyze,  and  store  all  datasets  in  one  plaporm   13
  14. 14. Example:  Text  Analysis  &  e-­‐Discovery  Business  Goal  •  Archive  ALL  electronic  documents  –  email,  Office,   PDF,  instant  messages,  etc  –  in  a  reference  archive,   retaining  original  document  formats.  Provide  rapid,   Building  the  Archive:   flexible  access  and  extrac-on  capabili-es  for   Scalability  and  Cost  Issues   eDiscovery  and  compliance  measures.   Teradata/Netezza  -­‐  $50K  –  100K/TB  Challenges  •  Massive  scale  of  documents  in  mul-ple  formats  and   Search  engine  -­‐  $100K/TB   structures.   IntegraSon  costs:  $150K  •  Sophis-cated  query  and  analysis  requirements.   Total:  $200K/TB  +  $150K  •  Future  formats  impossible  to  predict.  •  Must  retain  original  document  format.   Example  AnalySc  QuesSons  Hadapt  Value   •  Retrieve  all  emails  and  instant  messages   from  all  employees  in  Denver  office  •  Cost-­‐effecSve:  scale  to  100s  of  TB  and  PB  of  original   between  1995  and  1998   document  storage.   •  Who  are  the  top  10  recipients  of  emails  •  Flexible  query  access:  use  SQL,  Full  Text  Search,  or   from  Bob  Smith   combine  SQL+Search.  •  PreventaSve  analysis:  apply  deduplica-on,   sen-ment  analysis,  categoriza-on  to  accelerate   document  assessment.       14
  15. 15. Hadapt  –  Key  Considera-ons  Simplicity  •  All-­‐in-­‐one  system  for  “mul--­‐structured”  data  analy-cs  •  Single  cluster  for  analysis  of  mul-ple  data  types  –  low  TCO,  high  performance  •  Analyze  rela-onal  &  unstructured  data  together  to  answer  new  ques-ons  •  Eliminate  data  movement  between  Hadoop  and  RDBMS  •  Use  SQL  +  Full  Text  Search  –  a  fully  integrated  solu-on    Accessibility  •  Leverage  exis-ng  investment  in  SQL  tools  and  skills  •  Can  roll  out  Hadapt  analy-cs  to  exis-ng  BI  tool  users  •  Makes  Hadoop  easier  to  adopt  for  SQL-­‐heavy  enterprises  Scalability  /  Performance    •  Enormous  performance  boost  for  mul--­‐structured  data  analysis  •  Adap-ve  query  planning  provides  on-­‐the-­‐fly  load  balancing  &  fault  tolerance  •  Ad-­‐hoc  and  interac-ve  querying  of  massive  data  sets      
  16. 16.  QUESTIONS?