• Like
  • Save
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
764
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1   Advanced  Analy,cs  Part  I:  Use  All  Your  Data     DC:  Rob  Morrow,  Senior  Systems  Engineer   MD:  Chris  Bove,  Senior  Systems  Engineer   August  6  
  • 2. 2   From  BI  to  Advanced  Analy,cs   2   What  happened,   where,    and  when?   What  will   happen?   How  and  why   did  it  happen?   How  can  we  do   beLer?   Time   Data  Size   Facts                                              Interpreta,ons  
  • 3. 3   Tradi,onal  Analy,cs  Process   3   Opera,onalize   Model   In-­‐Database   Model   Scoring   Data   Cleansing  &   Processing   Data   Extrac,on   Data  Explora,on  &   Discovery   In-­‐Memory   Model   Development   Time-­‐to-­‐Insight  
  • 4. 4   Accessing  &  Sharing  the  Data  is  Difficult     DW   External  Mul7-­‐structured  Structured  
  • 5. 5   “Are  we  there  yet?”   1. Find the data 2. Get access to data 4. Move sample data to ADW 5. Analysis Finally! 6. Operationalize the model 3. Learn about the data Data  Discovery:   6-­‐9  Months  
  • 6. 6   Silo’d  PlaZorms  Challenge  Collabora,on   6   Departmental   Warehouse   Non-­‐Agile  Models   Enterprise   Apps   Repor,ng   Prioritized Operational Processes Departmental   Warehouse   Silo’d   Analy7cs   Sta,c  schemas   accrete  over  ,me   Data   Sources   Silo’d   Analy7cs  
  • 7. 7   1. Find the data 2. Get access To data 4. Move to ADW 5. Analysis Finally! 6. Operationalize the model 3. Learn about the data 6-­‐9  Months   Users  &  Business   Influencers   Data  Scien,st,     Business  Analysts   “I’m  sick  of  wai.ng  for  my   data,  I’m  going  to  make  my   own  copy.”   Technical   Influencers   DBA/DW  Admins   “I  need  to  get  those  data   scien.sts  the  data  they  want,   or  else  they  will  stand  up   another  data  mart,  I  will  have   to  manage  it  sooner  or  later.”   Ouch!   7   Execu,ves   Execu,ve  Sponsors,     LOB  Manager     (PM,  Director,  R&D,  etc.)   “We  don’t  have  the   informa.on  we  need  to   answer  key  business   ques.ons.”  
  • 8. 8   Unified  Scale-­‐out  Storage   For  Any  Type  of  Data   Elas,c,  Fault-­‐tolerant,  Self-­‐healing,  In-­‐memory  capabili,es   Resource  Management   Batch     Processing   Analy,c     MPP  DBMS   Search     Engine   Online   NoSQL     DBMS   Stream     Processing   Machine     Learning   SQL   Streaming   File  System  (NFS)   Data   Management   System   Management   Metadata,  Security,  Audit,  Lineage   Training  &  Services   Solu,on:  Cloudera  EDH   8   Search     Faster  data   discovery   Navigator   Mul7ple  tools  on   one  plaGorm   Impala   Spark   Hadoop     Map   Reduce   Use  all  data  with   centralized  mgmt   &  security   Metadata,  Security   Cloudera  Manager   Training  &  Services   Opera7onalize   Models   Flume  /   Spark   Streaming   HBase  
  • 9. 9          Enterprise  Data  Hub   Unified  Scale-­‐out  Storage   For  Any  Type  of  Data   Elas,c,  Fault-­‐tolerant,  Self-­‐healing,  In-­‐memory  capabili,es   Resource  Management   Batch     Processing   Analy,c     MPP  DBMS   Search     Engine   Online   NoSQL     DBMS   Stream     Processing   Machine     Learning   SQL   Streaming   File  System  (NFS)   Data   Management   System   Management   Metadata,  Security,  Audit,  Lineage   Training  &  Services   Solu,on:  Cloudera  EDH   9  
  • 10. 10   Analy,cs  with  EDH   10   Opera,onalize   Model   In-­‐Database   Model   Scoring   Data   Cleansing  &   Processing   Data  Explora,on  &   Discovery   In-­‐Memory   Model   Development   Time-­‐to-­‐Insight   Data   Explora,on  &   Discovery   Data   Cleansing  &   Processing   Opera,onalize   Model   Data   Extrac,on   In-­‐PlaGorm   Model  Dev  &   Scoring   Deliver  Insight  Sooner  
  • 11. 11   Solu,on  Benefits   •  Use  100x  more  data,  and  more  types  of  data,  with  exis,ng  tools     •  Reduce  sampling  and  increase  model  accuracy  and  precision   •  Centralize  informa,on  security,  metadata,  management,  and   governance   Use  all  your   data   •  Compress  the  cycle  7me  from  data  to  insights   •  Facilitate  data  discovery  with  real-­‐,me  SQL  and  Search   •  Track  data  life-­‐cycle  in  place   •  Define,  test,  deploy,  and  update  models  all  within  the  EDH   Shorten   analy,cs   lifecycle   •  Deliver  mul7-­‐genre  analy7cs  in  a  single  plaGorm   •  Apply  diverse  concurrent  analy,cs  to  your  full  datasets  in-­‐place   •  Protect  exis,ng  technology  and  skillset  investments   Do  more  with   data   11  
  • 12. 12   “I’m  sick  of  wai.ng  for  my   data,  I’m  going  to  make   my  own  copy.”   “I  need  to  get  those  data   scien.sts  the  data  they   want,  or  else  they  will   stand  up  another  data   mart,  which  I  will  have  to   manage  sooner  or  later.”   “We  don’t  have  the   informa.on  we  need  to   answer  key  business   ques.ons.”   Data  Scien,st,     Business  Analysts   DBA/DW  Admins   Execu,ve  Sponsors,     LOB  Manager     (Marke,ng,  Sales,  R&D,   etc.)   •  Acquire  data  necessary  for   projects   •  Develop  analysis/models   with  beLer  fit  faster     •  Share  data  sets  to   empower  others   •  Spend  less  ,me  and   money  reconciling   shadow  IT  environments   •  Shared  security,   metadata,  management,   and  governance   •  Acquire  necessary   informa,on  sooner  to   make  cri,cal  business   decisions   Business  Value  Delivered   12   Users  &  Business   Influencers   Technical   Influencers   Execu,ves   Buyers  
  • 13. 13   Thrio  pdf/ Word/txt   csv   Data  Access:  Stores  and  Connectors   13   CONNECTORS   ORACLE   NETEZZA   ODBC/JDBC   TERADATA   MongoDB   Splunk/Hunk   MICROSTRATEGY   IMPALA   HBASE   SOLR   SPARK   ACCUMULO   ZoomData   Hive   Sqoop   Flume   Partner  Na,ve   Connectors   Revolu7on  R   Parquet   Sequenc e   JSON   Binary   SkyTree   Avro  
  • 14. 14   Historical  Archive:  Tape  vs  Data   14   •  Direct  access  to  data  has  value,  Data  Stored  offsite/offline  has  cost   •  A  single  8k  record  may  have  nearly  zero  value,  but  10,000?  10,000,000?   •  What  is  Business  value  of  tes,ng  the  predic,ve  power  of  current  data?   Aggregate  Data   Value   •  Assuming  locality,  Is  110MB  per  drive  fast  enough?   •  Certainly  not  fast  enough  to  be  included  in  any  current  analy,cs.   •  Striping  across  tape  drives  is  Science-­‐Fic,on.  Complex  Tiers,  anyone?   •  I/O  IS  the  problem.  Not  CPU.   Data   Availability   •  Everyone  prac7ces  Backups.  How  about  Restores?  Full  site  restores?   •  Can’t  we  just  more  aggressively  compress  online  data?   •  “Tape  is  cheap”.  It  had  beLer  be,  because  the  data  isn’t  easily  usable.   Data  Volume/ Cost  
  • 15. 15   Spark  Streaming:  What  is  it?   15   Spark  is  processed  in  micro-­‐batches:   Resilient  Distributed  Datasets  (RDD)   Consistent  with  HDFS  Architectural  Principles   Processing  individual  records  creates  inconsistencies  (simultaneous  writes),  AKA  Storm.  
  • 16. 16   What  can  you  do  with  it?:  Stream  It   16   Streaming  “Windows”  allows  ,me-­‐sliced  atomic  updates  to  Analy,cs   Discre,zed  Stream  (DStream):   Sequence  of  RDD’s  arranged  as  lines/ words   Window:  Sequence  of  DStreams  ,me-­‐ arranged  as  windows  
  • 17. 17   What  can  you  do  with  it?:  ML   17   Spark-­‐ML:  Same  Input  format  and  algorithms  as   Mahout.   Uses  Resilient  Distributed  DataSets  In-­‐Memory     Useful  for:   Clustering  (k-­‐Means,  etc)   Classifica,on  (email,  sen,ment)   Recommenders  (ra,ngs  correla,on)   Dimensionality  Reduc,on  (PCA,  SVD)   “What  about   Machine   Learning?”  
  • 18. 18   Model  Effec,veness  and  Sampling   •  Some  Sta,s,cians  (medical)  find  it  hard  to  turn  the  corner  on  the  sampling  topic:   •  ANOVA  vs  Mul,ple  Regression.  Same  tests**,  one’s  a  vector  without  the  Power   problems   •  Algorithm  choice  should  be  related  to,  not  restricted  by,  data  volume.   •  Best  approach  =  simple  algorithm,  lots  of  data   •  Sampling  should  s7ll  be  used,  but  to  test  model  effec7veness.  Not  to  fix  IT.   **Source:  Applied  Mul,ple  Regression/Correla,on  Analysis  (Cohen  &  Cohen,  1983)  
  • 19. 19   Which  dataset  offers  beLer  predic,ve  power?   Remember,  this  is  not  tes,ng  for  an  effect…   Alic e   Bo b   Chuc k   Donna   Eddi e   Frank   Gina   Uses  work   computer   for   shopping   1   4   5   1   Moves   data   between   networks   4   5   2   Works  long   hours   4   3   3   System/ Network   admin   Privs   5   Alice   Bo b   Chuc k   Donn a   Eddi e   Frank   Gina   Uses  work   computer   for   shopping   1   4   2   4   5   1   3   Moves   data   between   networks   4   3   1   5   1   4   3   Works  long   hours   2   4   3   3   4   3   2   System/ Network   admin   Privs   1   2   1   5   3   5   4   1   2   1.  As  we  add  dimensions,  average  distance  increases.  Add  Data.     2.  Fewer  “neighbors”  within  a  certain  radius  of  any  given  point  when  the  dataset   is  smaller.  Add  Data.   3.  Are  you  looking  at  similarity  (r/cosine)  or  are  you  using  dissimilarity  (Euclidean)?  
  • 20. 20   Algorithms:  Clustering     Sort  documents,  emails,   objects  by  text  class  and   group  terms/documents   into  dis,nct  categories.     Produce  visualiza,on.   Ques,on:  What’s  an  emerging  topic  among  users?  
  • 21. 21   Algorithms:  Naïve  Bayesian  Classifier   Given  a  training   set,  sort   documents  by   content:  Spam/ Not,  Religion/ Poli,cs/Art,  etc.   Ques,on:  Which  content  “looks  like”  other  content?  
  • 22. 22   Algorithms:  Recommender  Systems   •  User-­‐based  filtering  for  cold  start   (AKA  “likes”)     •  Item-­‐based  (user  similarity)   filtering  once  there  is  sufficient   user  data   Ques,on:  If  user  thinks  “A”  is  useful,  how  about  “B”,  “C”?      How  similar  is  one  user’s  paLern  to  another?  
  • 23. 23   Easily  Convert  between  bits/bytes  and   numbers/words  with  Avro   •  Serializa,on   •  Expressive   •  Records,  arrays,  unions,  enums     •  Efficient   •  Compact  binary,  compressed,  spliLable     •  Interoperable     •  Langs:  C,  C++,  C#,  Java,  Perl,  Python,  Ruby,  PHP     •  Tools:  MR,  Pig,  Hive,  Crunch,  Flume,  Sqoop,  etc     •  Dynamic     •  Can  read  &  write  w/o  genera,ng  code  first     •  Evolvable      
  • 24. 24   Query  results  from  large  analyses  in  Impala   •  Brings  real-­‐,me  query  capabili,es  to  Hadoop   •  It’s  fast!  Na,vely  wriLen        in  C++   •  Same  great  SQL  query        language  as  Hive  
  • 25. 25   Analy,cs  to  users:  HUE   •  Included  in  EDH   •  Mul,-­‐capability   interface  for  analy,cs   •  Interac,ve  graph   libraries   •  Customizable  Search,   Impala,  Hive,  Pig  Apps   •  But  Also:  Tableau,   Pentaho,  PlaZora,   ZoomData,  SAS…  
  • 26. 26   Cloudera  Manager   End-­‐to-­‐End  Administra,on  for  CDH   Manage   Easily  deploy,  configure  &  op,mize  clusters  1 Monitor   Maintain  a  central  view  of  all  ac,vity  2 Diagnose   Easily  iden,fy  and  resolve  issues  3 Integrate   Use  Cloudera  Manager  with  exis,ng  tools   4
  • 27. Thank  You!  
  • 28. 28   2 8   Enterprise  Services   Inges,on  &  ETL   Pilot   Reference  implementa,on  up  to  3  sources,  5  transforma,ons,  1  target   Create,  execute,  test,  and  review  a  custom  inges,on/ETL  plan   Security   Integra,on     Implementa,on  of  role  based  access  control  with  the  data   processing  environment   Hadoop  Cluster   Deployment   Cer,fica,on   Fully  review  hardware,  data  sources,  typical  jobs,  and  exis,ng  SLAs   Develop,  implement,  benchmark,  and  document  Hadoop  deployment  
  • 29. 29   Path to Success – Services & Training   Hadoop  Cluster   Deployment  Cer,fica,on     1  week   Inges,on  &  ETL  Pilot     2  weeks   Security  Integra,on     1  week   Cloudera  Admin  Training     3  days   Hive/Pig  Training       2  days     Data   Science     3  days   Developer   Training     4  days  
  • 30. 30   ©2014  Cloudera,  Inc.  All   rights  reserved.   •  Winners  will  receive:   •  Free  Strata  +  Hadoop  World  pass   •  Free  seat  to  any  public  Cloudera   University  Training   •  Invita,on  to  exclusive  awards  dinner   •  Bragging  rights     Nomina7ons  are  open  for     the  Data  Impact  Awards!   Submission  deadline:  September  12th