Big data beyond the hype may 2014


Ron Bodkin of Think Big Analytics talks about trends in Big Data space

  1. 1. Big  Data  Beyond  the  Hype:   Trends,  Pi6alls  and  Building  Competency   May  2014   1  5/29/14  
  2. 2. Agenda •  Why  Big  Data   •  Technology  Capabili<es   •  Organiza<onal  Evolu<on   •  Case  Study   •  Conclusions   5/29/14   2  
  3. 3. Big Data 5/29/14   3   •  Data  sets  so  large  and   complex  that  they  become   awkward  to  work  with  using   standard  tools  and   techniques   •  Google,  Quantcast,  Yahoo,   LinkedIn…  customer   innova<on  in  open  source  
  4. 4. Patterns from Delivering Big Data at Scale eCommerce     2  of  Global  Top  5   Retail     2  of  Global  Top  5   Social  Networking     Global  #1   Banking     4  of  Global  Top  10   Credit  Issuer     2  of  Global  Top  5   Financial  Data  Services     2  of  Global  Top  5   Financial  Exchanges     Global  #2   Brokerage  &  Mutual  Funds     2  of  Global  Top  5   Asset  Management     Global  #1   Semiconductor     2  of  Global  Top  5   Data  Storage  Devices     3  of  Global  Top  5   Disk  Drive  Manufacturing     Global  #1   TelecommunicaJons   2  of  Global  Top  5   Media  &  AdverJsing     2  of  Global  Top  4   Internet  TransacJon  Security     Global  #1   5/29/14   4  
  5. 5. Manufacturing Analytics 1   Improve  Yield  -­‐  reduced  scrap,  faster  <me  to  market   2   Ease  data  access  –  quick  search  access  to  all  relevant  data  for  all  history  in  one   place,  not  samples,  summary  or  recent  data     3   Process  Efficiency  –  iden<fy  redundant  tests,  op<mize  throughput   4   Component  Analysis  –  proac<ve  discovery  of  component  problems,   compa<bility  detec<on   5   Service  Revenue  GeneraJon  -­‐  Leverage  customer  support  to  generate  new   revenues  by  increasing  customer  reten<on,  support  renewals,  and  upgrading   customers  to  new  products  and  licenses.   5  5/29/14  
  6. 6. Product Analytics 1   Issue  ResoluJon  -­‐  Faster  resolu<on  of  issues  through  deep  analysis  of   integrated  device  log,  product,  and  configura<on  data  across  the  install  base.   2   Support  Metrics  -­‐  Gain  new  insights  into  dimensions  impac<ng  service  metrics   by  combining  product  use  technical  data  with  business  data.  E.g.,  number  of   <ckets  resolved,  length  of  calls,  resource  alloca<on,  customer  sa<sfac<on.     3   ProacJve  Service  -­‐  Iden<fy  and  address  poten<al  issues  rather  than  wai<ng  to   react  to  them  aXer  they  happen.  I.e.,  intelligence  for  condi<on-­‐based   maintenance.   4   Customer  Self-­‐Service  -­‐  Drive  higher  rates  of  customer  self-­‐service  by  providing   customers  access  to  tools  and  resources  to  answer  their  own  ques<ons  and   solve  their  own  issues  by  analyzing  past  customer  ac<vity  and  integra<ng   technical  details  from  customer  systems  into  self-­‐service  portals.   5   Service  Revenue  GeneraJon  -­‐  Leverage  customer  support  to  generate  new   revenues  by  increasing  customer  reten<on,  support  renewals,  and  upgrading   customers  to  new  products  and  licenses.   6  5/29/14  
  7. 7. Clickstream Analytics 1   Timely  Processing  -­‐  Data  volumes  are  increasing  clogging  data  warehouse,  slow   and  out  of  date.   2   Cost  Efficiency  &  Scale  -­‐  Data  volumes  are  increasing  clogging  data  warehouse,   slow  and  out  of  date.   3   Flexible  Tagging  –  Analyze  site  without  having  to  push  data  to  tags,  classify   ac<vity  based  on  sec<ons,  URLs  more  flexibly  than  tags.   4   Converged  AnalyJcs  -­‐  Understand  and  improve  web,  mobile,  cross-­‐channel   customer  ac<vity.  Siloed  apps  costly  &  inflexible  –  e.g.,  tags  require  feeding   data  about  user  demographics,  segments.  Mash  up  interac<ons  with  data  like   geographic,  demographic,  social,  call  center,  purchases.   5   Deeper  ExploraJon  -­‐  Analyze  sequences  of  events  in  customer  journeys.  Deep   drill  down  (how  do  users  with  the  latest  iPhone  and  iOS  compare  to  latest   Samsung  and  Android).  Cohort  analysis  -­‐  those  who  buy  books  vs  spor<ng   goods.   6   PredicJve  Models  –  next  best  offer  recommenda<ons,  personaliza<on,   customer  value,  microsegmenta<on,  ad  aeribu<on…   7  5/29/14  
  8. 8. Agenda •  Why  Big  Data   •  Technology  CapabiliJes   •  Organiza<onal  Evolu<on   •  Case  Study   •  Conclusions   5/29/14   8  
  9. 9. Big Data Reference Architecture 5/29/14   9  
  10. 10. •  Community  Open  Source     •  Over  $1  Billion  invested  in  startups,  all  major  sw  companies  embrace   •  3  replicas  for  HA,  50  PB+  scale   Apache Hadoop 5/29/14   10   Cluster Slaves Client Servers ✲ Hive, Pig, ... ✲ cron+bash, Azkaban, … Sqoop, Scribe, … Monitoring, Management ... Secondary Master Server Secondary Name Node Primary Master Server ✲ Job Tracker Name Node Slave Server ✲ Task Tracker Data Node DiskDiskDiskDiskDiskDiskDiskDisk Slave Server ✲ Task Tracker Data Node DiskDiskDiskDiskDiskDiskDiskDisk Slave Server ✲ Task Tracker Data Node DiskDiskDiskDiskDiskDiskDiskDisk
  11. 11. Copyright Think Big Analytics 2014. All rights reserved. 11 Ÿ  Support for many processing engines including emerging Predictive Analytics libraries Ÿ  Common cluster with local data access in HDFS YARN - Yet Another Resource Negotiator
  12. 12. Copyright Think Big Analytics 2014. All rights reserved. 12 Ÿ  Focus on technologies for prescriptive analytics -  Getting insights from your data Ÿ  To get there you need -  data ingestion (Flume, Kafka, Sqoop, ...) -  data transformation (Pig, MapReduce, Hive, Cascading, …) Ÿ  Beyond prescriptive, Machine Learned predictive analytics is interesting Narrowing In
  13. 13. Copyright Think Big Analytics 2014. All rights reserved. 13 Ÿ  Scala-based more general distributed processing with long-lived memory caching Ÿ  First released 2010, many contributors Ÿ  Scala, Java, Python APIs Ÿ  Largest production cluster to date is 80 nodes at Yahoo Ÿ  YARN Support Ÿ  Subprojects -  Shark – Hive on Spark -  MLib – machine learning -  Spark Streaming -  GraphX Ÿ  Pros: faster iterative analysis, growing distribution support Ÿ  Cons: less mature, Scala API foundation, smaller community Apache Spark
  14. 14. Copyright Think Big Analytics 2014. All rights reserved. 14 Ÿ  DAG Processing of never ending streams of data -  First released 2011 -  Used at Twitter plus dozens of other companies -  Reliable - At Least Once semantics -  Think MapReduce for data streams -  Java / Clojure based -  Bolts in Java and ‘Shell Bolts’ -  Not a queue, but usually reads from a queue. -  YARN integration Ÿ  Pros: community, HW support, low latency, high throughput Ÿ  Cons: static topologies & cluster sizing, Nimbus – SPOF, state management Apache Storm
  15. 15. Platform Trends •  Hadoop  will  eat  Big  Data  Analy<cs   •  Hadoop  YARN  to  run  diverse  applica<ons   •  Real<me  latency:  Storm,  Samza,  HBase,  Cassandra,  Elas<csearch…   •  Management:  Ambari,  Pepperdata…   •  Security:  Knox,  Sentry,  XA  Secure…   •  Cloud:  AWS,  Qubole,  Altascale,  Info  Chimps…   15  5/29/14  
  16. 16. Analytics Trends •  Real<me  query  engines:  Hive/Tez,  Shark,  Impala,  Presto…   •  Machine  Learning:  Spark,  MLib,  Mahout,  Orryx…   •  Metadata:  HCatalog   •  Analyst  Tools:  Tableau,  Alpine  Data  Labs,  Plaoora,  Zoomdata…   •  Sensemaking:  Data  Tamer,  Trifacta,  Paxta…   16  5/29/14  
  17. 17. Agenda •  Why  Big  Data   •  Technology  Capabili<es   •  OrganizaJonal  EvoluJon   •  Case  Study   •  Conclusions   5/29/14   17  
  18. 18. How To Get the Right Answers Data Weak intersection between Data, Systems and Business often means project failure Current State Improved State Collaborative integration of Analytics with Big Data capabilities for success. Data Weak Area with diminished influence Cross-functional collaboration Data-driven business alignment 5/29/14   18  
  19. 19. Analytics and Data Science Analy<cs  is  the  umbrella  term  for  discovery  and  communica<on  of  signal  in  data.   •  Intelligence  and  ReporEng:  Governed  insights  for  general  audiences  by  providing   summaries  and  visualiza<ons  over  well  understood  data   •  DescripEve  AnalyEcs:  Interpretable  data  stories  to  domain  consumers  (e.g.  business,   sales,  etc.)  using  exploratory  data  analysis  and  descrip<ve  models  (e.g.  clustering)  over   well  understood  data   •  PredicEve  AnalyEcs:  Predicts  future  state  or  assesses  impact  of  changing  parameters  for   domain  consumers  (e.g.  business,  sales,  etc.)  using  sta<s<cal  modeling  over  well   understood  data   •  Data  Science:  Uncovers  new  or  missing  signals  and  rela<onships  in  data  to  be  leveraged   by  other  analy<cs.    Supported  with  a  mix  of  exploratory  data  analysis  and  sta<s<cal   modeling  over  unexplored  data  or  new  combina<ons  of  known  data.     New  capability  driven  by  Big  Data,  currently  overhyped  and  overused.   End-­‐goal:  effecEve  and  repeatable  integraEon  of  analyEcs  into  business  processes.   5/29/14   19  
  20. 20. Basic   Repor<ng   Big  Data   Repor<ng   Data   Science   Hub   Business   Analy<cs   Data  Driven   Organiza<on   Analy<c   Integra<on   High   Low   Big  Data   Capability   Limited   Full   Analytics Capability Model 5/29/14   20  
  21. 21. Data   Science   Hub   Data  Driven   Organiza<on   Analy<c   Integra<on   High   Low   Big  Data   Capability   Limited   Full   •  Redundant  infrastructure   •  Data  locked  down  w/in  product  teams   •  Analy<c  innova<ons  not  shared   •  Share  data  internally   •  Support  joining  of  data   •  Make  analy<cs  capabili<es   accessible  to  all  business  lines.     Client Example – Internet Infrastructure Firm 5/29/14   21  
  22. 22. Agenda •  Why  Big  Data   •  Technology  Capabili<es   •  Organiza<onal  Evolu<on   •  Case  Study   •  Conclusions   5/29/14   22  
  23. 23. •  “Reduce  Data  Search  Par<es”   •  Stop  playing  “Where’s  Waldo  with  your  Data”   •     “  I  know  I  have  that  data…..  somewhere?”   •  Data  Aggrega<on  to  a  Common  Plaoorm  with   common  access  tools       •  Improve  Yields  by  Accessing  More  Data  in  a  More   Timely  Manner     •  End  to  end  visibility  for  every  test,  every  diagnos<c  and  all   info  from  all  components  of  a  product  (internal  and  external)   •  Speed  up  yield  improvement  ramp  up  on  new  products   •  Improve  steady  state  yield  on  exis<ng  products.   Use Cases 23  5/29/14  
  24. 24. Big Data Platform Devices Components Assemblies … Field Data Supplier All raw parametric, logistic, vintage, data ERP/DW’s Data Sources End-­‐to-­‐End   Integrated   Data   Consumers   Ad  hoc  Analysis   BI  Tools   Op<mize/Reduce   Tes<ng   Data  Marts   Batch  AnalyJcs   Parallelized batch analytics App-Specific Views New High-Value Parameters raw extracts Enriched data Proac<ve  Issue   Iden<fica<on   Failure  Screen   Tests   Customer  FA  via   Field  Data   Predic<ve  Analy<cs     Tools   …other .. 5/29/14   24  
  25. 25. Highlights •  Enterprise-­‐wide  data  access  for  <mely  analy<cs  and  insights   •  Founda<on  for  large  scale  proac<ve  analy<cs   Legacy   BDP   Reten<on   3-­‐6  months  scaeered  DBs   then  tape  archive   All  data  online  in  common  data   lake  for  3+  years   Coverage   Summaries,  samples  for  several   data  sets   All  parametric  data  captured  in   raw  and  integrated  form   Analysis   Reac<ve  resul<ng  in  missed   improvement  opportuni<es   In  daily  opera<ons  and  on   larger  data  sets  to  support   proac<ve  improvements   25  5/29/14  
  26. 26. •  Metrics:   •  Collec<ng  >2M  manufacturing/tes<ng  binary  files  daily   •  Collec<ng  from  ~500  tables  across  6  databases  à  tens  of  millions  of  records  daily   •  Over  140  users  to  date  in  early  pilo<ng   •  Over  150  aeendees  par<cipated  in  BDP  training   •  Highlights:  although  early  in  the  overall  journey,  the  BDP  is  already  demonstra<ng  early  benefits:   Key Metrics & Highlights Development Engineer: demonstrated the joining of data sets for detailed logistics tracking—analyses that is very difficult to conduct with current systems Ops Engineer: a recent production issue required detailed historical data. Current systems did not have the required retention for this data. However, the team was able to pull the data from the BDP in minutes, as opposed to 3+ weeks to pull the data from tape archive Development Engineer: obtained technical data from the BDP in hours as opposed to 3+ weeks to pull from tape archive DATA SEARCH PARTIES YIELD 5/29/14  
  27. 27. Governance – Project Prioritization New Data Sets Use Cases from Strategy New Apps / Reports New Analytics Working Steering Committee Executive Steering Committee 1.  New Data Set 1 2.  New Use Case 1 3.  Use Case from Strategy Prioritized List Business Experts Business impact vs. ease of delivery, etc. Nominated by… Prioritized by… Core Platform Updates Sponsored by… 27  5/29/14  
  28. 28. Release Cycles July   Aug   Sept   Oct   Nov   Dec   Oct  Jan   Sept  Feb   Today Mar   Apr   May   Big  Data  Plaoorm  Roadmap   Jun   July   Aug   Release  1b:  all   ac<ve   products   Release  1a:  base   plaoorm,  first   product   Release  2:   Field/RMA   Data   Quality Development Operations Core/Other Release  3   Release  4   Release  5   Reduce  Scrap   New  Product  Data  Access   New  Data  Sets   Ongoing   Support   Faster  Response  to  Field   Requests   Correlated   Analysis   Improved  Failure  Analyses   Delivering new capabilities to the business every 45 – 90 days 28  5/29/14  
  29. 29. Agenda •  Why  Big  Data   •  Technology  CapabiliJes   •  Organiza<onal  Evolu<on   •  Case  Study   •  Conclusions   5/29/14   29  
  30. 30. Stages of Big Data Adoption Contain Costs & Scale Big Data Reporting Agile Analytics & Insight Innovation 360o view Speed Accuracy Data Science Hub Business Innovation Optimization Engagement Data Science Automation Organizational Transformation New products Customer analytics Operational efficiency New businesses Data Driven Organization Sustained competitive advantage through Big Analytics Traditional B.I. 5/29/14   30  
  31. 31. Pitfalls •  Data  Poli<cs   •  Immature  Governance   •  Siloed  Organiza<on   •  Siloed  Applica<ons   •  Buy-­‐in   •  Leaping  over  phases  /  Bypassing  low  hanging  fruit   •  Disrup<ve  Change   •  Insufficient  funding   •  Big  Bang  Rollout   •  Skills  Gap   •  Tac<cal  use  cases/the  big  data  plateau   •  Business  as  usual   5/29/14   31  
  32. 32. Conclusions •  Big  Data  is  affec<ng  all  companies  especially  in  tech   •  Start  with  low  hanging  fruit   •  Establish  a  roadmap  for  incremental  organiza<onal  evolu<on  and   technical  capabili<es   5/29/14   32  
