Your SlideShare is downloading. ×
1	
  
Advanced	
  Analy,cs	
  Part	
  I:	
  Use	
  All	
  Your	
  Data	
  
	
  
DC:	
  Rob	
  Morrow,	
  Senior	
  Systems...
2	
  
From	
  BI	
  to	
  Advanced	
  Analy,cs	
  
2	
  
What	
  happened,	
  
where,	
  
	
  and	
  when?	
  
What	
  wil...
3	
  
Tradi,onal	
  Analy,cs	
  Process	
  
3	
  
Opera,onalize	
  
Model	
  
In-­‐Database	
  
Model	
  
Scoring	
  
Data...
4	
  
Accessing	
  &	
  Sharing	
  the	
  Data	
  is	
  Difficult	
  	
  
DW	
  
External	
  Mul7-­‐structured	
  Structured...
5	
  
“Are	
  we	
  there	
  yet?”	
  
1. Find
the data
2. Get access
to data
4. Move
sample data
to ADW
5. Analysis
Final...
6	
  
Silo’d	
  PlaZorms	
  Challenge	
  Collabora,on	
  
6	
  
Departmental	
  
Warehouse	
  
Non-­‐Agile	
  Models	
  
E...
7	
  
1. Find
the data
2. Get access
To data
4. Move to
ADW
5. Analysis
Finally!
6. Operationalize
the model
3. Learn
abou...
8	
  
Unified	
  Scale-­‐out	
  Storage	
  
For	
  Any	
  Type	
  of	
  Data	
  
Elas,c,	
  Fault-­‐tolerant,	
  Self-­‐hea...
9	
  
	
  	
  	
  	
  Enterprise	
  Data	
  Hub	
  
Unified	
  Scale-­‐out	
  Storage	
  
For	
  Any	
  Type	
  of	
  Data	...
10	
  
Analy,cs	
  with	
  EDH	
  
10	
  
Opera,onalize	
  
Model	
  
In-­‐Database	
  
Model	
  
Scoring	
  
Data	
  
Cle...
11	
  
Solu,on	
  Benefits	
  
•  Use	
  100x	
  more	
  data,	
  and	
  more	
  types	
  of	
  data,	
  with	
  exis,ng	
 ...
12	
  
“I’m	
  sick	
  of	
  wai.ng	
  for	
  my	
  
data,	
  I’m	
  going	
  to	
  make	
  
my	
  own	
  copy.”	
  
“I	
 ...
13	
  
Thrio	
  pdf/
Word/txt	
  
csv	
  
Data	
  Access:	
  Stores	
  and	
  Connectors	
  
13	
  
CONNECTORS	
  
ORACLE	...
14	
  
Historical	
  Archive:	
  Tape	
  vs	
  Data	
  
14	
  
•  Direct	
  access	
  to	
  data	
  has	
  value,	
  Data	...
15	
  
Spark	
  Streaming:	
  What	
  is	
  it?	
  
15	
  
Spark	
  is	
  processed	
  in	
  micro-­‐batches:	
  
Resilien...
16	
  
What	
  can	
  you	
  do	
  with	
  it?:	
  Stream	
  It	
  
16	
  
Streaming	
  “Windows”	
  allows	
  ,me-­‐slice...
17	
  
What	
  can	
  you	
  do	
  with	
  it?:	
  ML	
  
17	
  
Spark-­‐ML:	
  Same	
  Input	
  format	
  and	
  algorith...
18	
  
Model	
  Effec,veness	
  and	
  Sampling	
  
•  Some	
  Sta,s,cians	
  (medical)	
  find	
  it	
  hard	
  to	
  turn	...
19	
  
Which	
  dataset	
  offers	
  beLer	
  predic,ve	
  power?	
  
Remember,	
  this	
  is	
  not	
  tes,ng	
  for	
  an...
20	
  
Algorithms:	
  Clustering	
  
	
  
Sort	
  documents,	
  emails,	
  
objects	
  by	
  text	
  class	
  and	
  
grou...
21	
  
Algorithms:	
  Naïve	
  Bayesian	
  Classifier	
  
Given	
  a	
  training	
  
set,	
  sort	
  
documents	
  by	
  
c...
22	
  
Algorithms:	
  Recommender	
  Systems	
  
•  User-­‐based	
  filtering	
  for	
  cold	
  start	
  
(AKA	
  “likes”)	...
23	
  
Easily	
  Convert	
  between	
  bits/bytes	
  and	
  
numbers/words	
  with	
  Avro	
  
•  Serializa,on	
  
•  Expr...
24	
  
Query	
  results	
  from	
  large	
  analyses	
  in	
  Impala	
  
•  Brings	
  real-­‐,me	
  query	
  capabili,es	
...
25	
  
Analy,cs	
  to	
  users:	
  HUE	
  
•  Included	
  in	
  EDH	
  
•  Mul,-­‐capability	
  
interface	
  for	
  analy...
26	
  
Cloudera	
  Manager	
  
End-­‐to-­‐End	
  Administra,on	
  for	
  CDH	
  
Manage	
  
Easily	
  deploy,	
  configure	...
Thank	
  You!	
  
28	
  
2
8	
  
Enterprise	
  Services	
  
Inges,on	
  &	
  ETL	
  
Pilot	
  
Reference	
  implementa,on	
  up	
  to	
  3	
...
29	
  
Path to Success – Services & Training	
  
Hadoop	
  Cluster	
  
Deployment	
  Cer,fica,on	
  
	
  
1	
  week	
  
Ing...
30	
   ©2014	
  Cloudera,	
  Inc.	
  All	
  
rights	
  reserved.	
  
•  Winners	
  will	
  receive:	
  
•  Free	
  Strata	...
Upcoming SlideShare
Loading in...5
×

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data

1,093

Published on

Presented on August 6, 2014

Published in: Technology

Transcript of "Cloudera Breakfast Series, Analytics Part 1: Use All Your Data"

  1. 1. 1   Advanced  Analy,cs  Part  I:  Use  All  Your  Data     DC:  Rob  Morrow,  Senior  Systems  Engineer   MD:  Chris  Bove,  Senior  Systems  Engineer   August  6  
  2. 2. 2   From  BI  to  Advanced  Analy,cs   2   What  happened,   where,    and  when?   What  will   happen?   How  and  why   did  it  happen?   How  can  we  do   beLer?   Time   Data  Size   Facts                                              Interpreta,ons  
  3. 3. 3   Tradi,onal  Analy,cs  Process   3   Opera,onalize   Model   In-­‐Database   Model   Scoring   Data   Cleansing  &   Processing   Data   Extrac,on   Data  Explora,on  &   Discovery   In-­‐Memory   Model   Development   Time-­‐to-­‐Insight  
  4. 4. 4   Accessing  &  Sharing  the  Data  is  Difficult     DW   External  Mul7-­‐structured  Structured  
  5. 5. 5   “Are  we  there  yet?”   1. Find the data 2. Get access to data 4. Move sample data to ADW 5. Analysis Finally! 6. Operationalize the model 3. Learn about the data Data  Discovery:   6-­‐9  Months  
  6. 6. 6   Silo’d  PlaZorms  Challenge  Collabora,on   6   Departmental   Warehouse   Non-­‐Agile  Models   Enterprise   Apps   Repor,ng   Prioritized Operational Processes Departmental   Warehouse   Silo’d   Analy7cs   Sta,c  schemas   accrete  over  ,me   Data   Sources   Silo’d   Analy7cs  
  7. 7. 7   1. Find the data 2. Get access To data 4. Move to ADW 5. Analysis Finally! 6. Operationalize the model 3. Learn about the data 6-­‐9  Months   Users  &  Business   Influencers   Data  Scien,st,     Business  Analysts   “I’m  sick  of  wai.ng  for  my   data,  I’m  going  to  make  my   own  copy.”   Technical   Influencers   DBA/DW  Admins   “I  need  to  get  those  data   scien.sts  the  data  they  want,   or  else  they  will  stand  up   another  data  mart,  I  will  have   to  manage  it  sooner  or  later.”   Ouch!   7   Execu,ves   Execu,ve  Sponsors,     LOB  Manager     (PM,  Director,  R&D,  etc.)   “We  don’t  have  the   informa.on  we  need  to   answer  key  business   ques.ons.”  
  8. 8. 8   Unified  Scale-­‐out  Storage   For  Any  Type  of  Data   Elas,c,  Fault-­‐tolerant,  Self-­‐healing,  In-­‐memory  capabili,es   Resource  Management   Batch     Processing   Analy,c     MPP  DBMS   Search     Engine   Online   NoSQL     DBMS   Stream     Processing   Machine     Learning   SQL   Streaming   File  System  (NFS)   Data   Management   System   Management   Metadata,  Security,  Audit,  Lineage   Training  &  Services   Solu,on:  Cloudera  EDH   8   Search     Faster  data   discovery   Navigator   Mul7ple  tools  on   one  plaGorm   Impala   Spark   Hadoop     Map   Reduce   Use  all  data  with   centralized  mgmt   &  security   Metadata,  Security   Cloudera  Manager   Training  &  Services   Opera7onalize   Models   Flume  /   Spark   Streaming   HBase  
  9. 9. 9          Enterprise  Data  Hub   Unified  Scale-­‐out  Storage   For  Any  Type  of  Data   Elas,c,  Fault-­‐tolerant,  Self-­‐healing,  In-­‐memory  capabili,es   Resource  Management   Batch     Processing   Analy,c     MPP  DBMS   Search     Engine   Online   NoSQL     DBMS   Stream     Processing   Machine     Learning   SQL   Streaming   File  System  (NFS)   Data   Management   System   Management   Metadata,  Security,  Audit,  Lineage   Training  &  Services   Solu,on:  Cloudera  EDH   9  
  10. 10. 10   Analy,cs  with  EDH   10   Opera,onalize   Model   In-­‐Database   Model   Scoring   Data   Cleansing  &   Processing   Data  Explora,on  &   Discovery   In-­‐Memory   Model   Development   Time-­‐to-­‐Insight   Data   Explora,on  &   Discovery   Data   Cleansing  &   Processing   Opera,onalize   Model   Data   Extrac,on   In-­‐PlaGorm   Model  Dev  &   Scoring   Deliver  Insight  Sooner  
  11. 11. 11   Solu,on  Benefits   •  Use  100x  more  data,  and  more  types  of  data,  with  exis,ng  tools     •  Reduce  sampling  and  increase  model  accuracy  and  precision   •  Centralize  informa,on  security,  metadata,  management,  and   governance   Use  all  your   data   •  Compress  the  cycle  7me  from  data  to  insights   •  Facilitate  data  discovery  with  real-­‐,me  SQL  and  Search   •  Track  data  life-­‐cycle  in  place   •  Define,  test,  deploy,  and  update  models  all  within  the  EDH   Shorten   analy,cs   lifecycle   •  Deliver  mul7-­‐genre  analy7cs  in  a  single  plaGorm   •  Apply  diverse  concurrent  analy,cs  to  your  full  datasets  in-­‐place   •  Protect  exis,ng  technology  and  skillset  investments   Do  more  with   data   11  
  12. 12. 12   “I’m  sick  of  wai.ng  for  my   data,  I’m  going  to  make   my  own  copy.”   “I  need  to  get  those  data   scien.sts  the  data  they   want,  or  else  they  will   stand  up  another  data   mart,  which  I  will  have  to   manage  sooner  or  later.”   “We  don’t  have  the   informa.on  we  need  to   answer  key  business   ques.ons.”   Data  Scien,st,     Business  Analysts   DBA/DW  Admins   Execu,ve  Sponsors,     LOB  Manager     (Marke,ng,  Sales,  R&D,   etc.)   •  Acquire  data  necessary  for   projects   •  Develop  analysis/models   with  beLer  fit  faster     •  Share  data  sets  to   empower  others   •  Spend  less  ,me  and   money  reconciling   shadow  IT  environments   •  Shared  security,   metadata,  management,   and  governance   •  Acquire  necessary   informa,on  sooner  to   make  cri,cal  business   decisions   Business  Value  Delivered   12   Users  &  Business   Influencers   Technical   Influencers   Execu,ves   Buyers  
  13. 13. 13   Thrio  pdf/ Word/txt   csv   Data  Access:  Stores  and  Connectors   13   CONNECTORS   ORACLE   NETEZZA   ODBC/JDBC   TERADATA   MongoDB   Splunk/Hunk   MICROSTRATEGY   IMPALA   HBASE   SOLR   SPARK   ACCUMULO   ZoomData   Hive   Sqoop   Flume   Partner  Na,ve   Connectors   Revolu7on  R   Parquet   Sequenc e   JSON   Binary   SkyTree   Avro  
  14. 14. 14   Historical  Archive:  Tape  vs  Data   14   •  Direct  access  to  data  has  value,  Data  Stored  offsite/offline  has  cost   •  A  single  8k  record  may  have  nearly  zero  value,  but  10,000?  10,000,000?   •  What  is  Business  value  of  tes,ng  the  predic,ve  power  of  current  data?   Aggregate  Data   Value   •  Assuming  locality,  Is  110MB  per  drive  fast  enough?   •  Certainly  not  fast  enough  to  be  included  in  any  current  analy,cs.   •  Striping  across  tape  drives  is  Science-­‐Fic,on.  Complex  Tiers,  anyone?   •  I/O  IS  the  problem.  Not  CPU.   Data   Availability   •  Everyone  prac7ces  Backups.  How  about  Restores?  Full  site  restores?   •  Can’t  we  just  more  aggressively  compress  online  data?   •  “Tape  is  cheap”.  It  had  beLer  be,  because  the  data  isn’t  easily  usable.   Data  Volume/ Cost  
  15. 15. 15   Spark  Streaming:  What  is  it?   15   Spark  is  processed  in  micro-­‐batches:   Resilient  Distributed  Datasets  (RDD)   Consistent  with  HDFS  Architectural  Principles   Processing  individual  records  creates  inconsistencies  (simultaneous  writes),  AKA  Storm.  
  16. 16. 16   What  can  you  do  with  it?:  Stream  It   16   Streaming  “Windows”  allows  ,me-­‐sliced  atomic  updates  to  Analy,cs   Discre,zed  Stream  (DStream):   Sequence  of  RDD’s  arranged  as  lines/ words   Window:  Sequence  of  DStreams  ,me-­‐ arranged  as  windows  
  17. 17. 17   What  can  you  do  with  it?:  ML   17   Spark-­‐ML:  Same  Input  format  and  algorithms  as   Mahout.   Uses  Resilient  Distributed  DataSets  In-­‐Memory     Useful  for:   Clustering  (k-­‐Means,  etc)   Classifica,on  (email,  sen,ment)   Recommenders  (ra,ngs  correla,on)   Dimensionality  Reduc,on  (PCA,  SVD)   “What  about   Machine   Learning?”  
  18. 18. 18   Model  Effec,veness  and  Sampling   •  Some  Sta,s,cians  (medical)  find  it  hard  to  turn  the  corner  on  the  sampling  topic:   •  ANOVA  vs  Mul,ple  Regression.  Same  tests**,  one’s  a  vector  without  the  Power   problems   •  Algorithm  choice  should  be  related  to,  not  restricted  by,  data  volume.   •  Best  approach  =  simple  algorithm,  lots  of  data   •  Sampling  should  s7ll  be  used,  but  to  test  model  effec7veness.  Not  to  fix  IT.   **Source:  Applied  Mul,ple  Regression/Correla,on  Analysis  (Cohen  &  Cohen,  1983)  
  19. 19. 19   Which  dataset  offers  beLer  predic,ve  power?   Remember,  this  is  not  tes,ng  for  an  effect…   Alic e   Bo b   Chuc k   Donna   Eddi e   Frank   Gina   Uses  work   computer   for   shopping   1   4   5   1   Moves   data   between   networks   4   5   2   Works  long   hours   4   3   3   System/ Network   admin   Privs   5   Alice   Bo b   Chuc k   Donn a   Eddi e   Frank   Gina   Uses  work   computer   for   shopping   1   4   2   4   5   1   3   Moves   data   between   networks   4   3   1   5   1   4   3   Works  long   hours   2   4   3   3   4   3   2   System/ Network   admin   Privs   1   2   1   5   3   5   4   1   2   1.  As  we  add  dimensions,  average  distance  increases.  Add  Data.     2.  Fewer  “neighbors”  within  a  certain  radius  of  any  given  point  when  the  dataset   is  smaller.  Add  Data.   3.  Are  you  looking  at  similarity  (r/cosine)  or  are  you  using  dissimilarity  (Euclidean)?  
  20. 20. 20   Algorithms:  Clustering     Sort  documents,  emails,   objects  by  text  class  and   group  terms/documents   into  dis,nct  categories.     Produce  visualiza,on.   Ques,on:  What’s  an  emerging  topic  among  users?  
  21. 21. 21   Algorithms:  Naïve  Bayesian  Classifier   Given  a  training   set,  sort   documents  by   content:  Spam/ Not,  Religion/ Poli,cs/Art,  etc.   Ques,on:  Which  content  “looks  like”  other  content?  
  22. 22. 22   Algorithms:  Recommender  Systems   •  User-­‐based  filtering  for  cold  start   (AKA  “likes”)     •  Item-­‐based  (user  similarity)   filtering  once  there  is  sufficient   user  data   Ques,on:  If  user  thinks  “A”  is  useful,  how  about  “B”,  “C”?      How  similar  is  one  user’s  paLern  to  another?  
  23. 23. 23   Easily  Convert  between  bits/bytes  and   numbers/words  with  Avro   •  Serializa,on   •  Expressive   •  Records,  arrays,  unions,  enums     •  Efficient   •  Compact  binary,  compressed,  spliLable     •  Interoperable     •  Langs:  C,  C++,  C#,  Java,  Perl,  Python,  Ruby,  PHP     •  Tools:  MR,  Pig,  Hive,  Crunch,  Flume,  Sqoop,  etc     •  Dynamic     •  Can  read  &  write  w/o  genera,ng  code  first     •  Evolvable      
  24. 24. 24   Query  results  from  large  analyses  in  Impala   •  Brings  real-­‐,me  query  capabili,es  to  Hadoop   •  It’s  fast!  Na,vely  wriLen        in  C++   •  Same  great  SQL  query        language  as  Hive  
  25. 25. 25   Analy,cs  to  users:  HUE   •  Included  in  EDH   •  Mul,-­‐capability   interface  for  analy,cs   •  Interac,ve  graph   libraries   •  Customizable  Search,   Impala,  Hive,  Pig  Apps   •  But  Also:  Tableau,   Pentaho,  PlaZora,   ZoomData,  SAS…  
  26. 26. 26   Cloudera  Manager   End-­‐to-­‐End  Administra,on  for  CDH   Manage   Easily  deploy,  configure  &  op,mize  clusters  1 Monitor   Maintain  a  central  view  of  all  ac,vity  2 Diagnose   Easily  iden,fy  and  resolve  issues  3 Integrate   Use  Cloudera  Manager  with  exis,ng  tools   4
  27. 27. Thank  You!  
  28. 28. 28   2 8   Enterprise  Services   Inges,on  &  ETL   Pilot   Reference  implementa,on  up  to  3  sources,  5  transforma,ons,  1  target   Create,  execute,  test,  and  review  a  custom  inges,on/ETL  plan   Security   Integra,on     Implementa,on  of  role  based  access  control  with  the  data   processing  environment   Hadoop  Cluster   Deployment   Cer,fica,on   Fully  review  hardware,  data  sources,  typical  jobs,  and  exis,ng  SLAs   Develop,  implement,  benchmark,  and  document  Hadoop  deployment  
  29. 29. 29   Path to Success – Services & Training   Hadoop  Cluster   Deployment  Cer,fica,on     1  week   Inges,on  &  ETL  Pilot     2  weeks   Security  Integra,on     1  week   Cloudera  Admin  Training     3  days   Hive/Pig  Training       2  days     Data   Science     3  days   Developer   Training     4  days  
  30. 30. 30   ©2014  Cloudera,  Inc.  All   rights  reserved.   •  Winners  will  receive:   •  Free  Strata  +  Hadoop  World  pass   •  Free  seat  to  any  public  Cloudera   University  Training   •  Invita,on  to  exclusive  awards  dinner   •  Bragging  rights     Nomina7ons  are  open  for     the  Data  Impact  Awards!   Submission  deadline:  September  12th  

×