Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Keynote on 2015 Yale Day of Data

817 views

Published on

Big Data & Analytics: Five Trends and Five Research Challenges

Published in: Data & Analytics
  • Be the first to comment

Keynote on 2015 Yale Day of Data

  1. 1. Big  Data  &  Analy-cs:   Five  Trends  and  Five  Research  Challenges   Robert  Grossman   University  of  Chicago   &     Open  Data  Group     September  18,  2015  
  2. 2. Part  1   What  is  Big  Data?   Researchers  and  policymakers  are  beginning  to  realize  the  poten-al  for  channeling  these   torrents  of  data  into  ac-onable  informa-on  that  can  be  used  to  iden-fy  needs  &  provide   services  for  the  benefit  of  low-­‐income  popula-ons.    Source:  Big  Data,  Big  Impact:  New   Possibili-es  for  Interna-onal  Development,  World  Economic  Forum,  2012.  
  3. 3. •  Volume   •  Velocity   •  Variety   •  Veracity   •  Value   •  Megabytes   •  Gigabytes   •  Terabytes     •  Petabytes   •  Etabytes   •  Zetabytes  
  4. 4. The  Name  Changes   1830      sta-s-cs     1980      computa-onally  intensive  sta-s-cs   1993      data  mining  &  knowledge  discovery  in  databases   1997      business  analy-cs   2004      predic-ve  analy-cs   2011      big  data,  data  science  &  data  analy-cs   Source:  Google  Trends,  www.google.com/trends  
  5. 5. What  is  Big  Data?     (Opera-ons  POV)   A  marke-ng  term  introduced  by  O’Reilly:     Big  data  is  data  that  exceeds  the  processing  capacity  of   conven-onal  database  systems.  The  data  is  too  big,   moves  too  fast,  or  doesn’t  fit  the  strictures  of  your   database  architectures.  To  gain  value  from  this  data,   you  must  choose  an  alterna-ve  way  to  process  it.       Edd  Dumbill,  What  is  Big  Data?,  strata.oreilly.com,   January  11,  2012.    
  6. 6. What  is  Big  Data?   (POV:  New  Types  of  Data  that  IT  Cannot  Manage)     Period   New  types  of  data   Term  Used   1990’s   Clicks  on  the  Internet,   POS  transac-ons   Data  mining   2000’s   Unstructured  data,   graph  data   Predic-ve   Analy-cs   2010’s   Mobile  data,  IoT  data   Big  Data  
  7. 7. What  Is  Small  Data?   •  100  million  movie  ra-ngs   •  480  thousand  customers   •  17,000  movies   •  From  1998  to  2005   •  Less  than  2  GB  data.   •  Fits  into  memory,  but  very   sophis-cated  models   required  to  win.  
  8. 8. What  are  the  origins  of  big  data?  
  9. 9. Basic  Choice  with  Hardware:  Scale  Up  or  Out   More  memory,   more  processors,   more  disk  ($K)   Specialized   hardware     (e.g.  connects) ($100K)   Specialized     devices  ($M)   One  machine   Cluster   (racks)   ($100K)   Cyber     Pod   $M   Distributed   cyber  pods   $10M+  
  10. 10. Source:  Interior  of  one  of  Google’s  Data  Center,  www.google.com/about/datacenters/   Computa-onal  adver-sing  finds   the  “best  match”  between  a  given   user  in  a  given  context  and  a   suitable  adver-sement  ($100+  B   market).    
  11. 11. The  Google  Data  Stack   •  The  Google  File  System  (2003)   •  MapReduce:  Simplified  Data  Processing…  (2004)   •  BigTable:  A  Distributed  Storage  System…  (2006)   11  
  12. 12. Source:  Terence  Kawaja,  hnp://www.slideshare.net/tkawaja  
  13. 13. •  The  leaders  in  big  data  analy-cs   measure  data  in  Megawans.         – As  in,  Facebook’s  leased  data   centers  are  typically  between   2.5  MW  and  6.0  MW.   – Facebook’s  new  Pineville  data   center  is  30  MW.     What  is  Big  Data?   (My  computer  is  a  data  center  POV)  
  14. 14. Part  2   What  is  Analy-cs?   Source:  Aaron  Parecki,  Everywhere  I’ve  Been,  aaronparecki.com.  
  15. 15. What  is  Analy-cs?   Short  Defini8on   •  Using  data  to  make  decisions.   Longer  Defini8on   •  Using  data  to  take  ac-ons  and  make  decisions  using   models  that  are  sta-s-cally  valid  and  empirically  derived.     Defini-on  of  Sta-s-cs  from  ASA  web  page:   •  Sta-s-cs  is  the  science  of  learning  from  data,  and  of   measuring,  controlling,  and  communica-ng  uncertainty  …     15   Source:  American  Sta-s-cal  Associa-on,    www.amstat.org/careers/wha-ssta-s-cs.cfm,  from:   Davidian,  M.  and  Louis,  T.  A.,  10.1126/science.1218685.  
  16. 16. 16   1993   2004   Data  Mining     &  KDD   1984   Computa-onally   Intensive  Sta-s-cs   Predic-ve   Analy-cs   Big  Data  &   Data  Science   2011   PageRank   Spanner  TX   algorithm   Devices/IoT  Internet  POS  Direct  marke-ng   ID3  &  C4.5  
  17. 17. 1.  Given  n  planes  A1,  …,  An.      Assume  each  plane  Ai  has  bij  bullet  holes  in   the  tail,  wing,  fuselage  and  other  (j=1,  2,  3,  4,  respec-vely).     2.  Compute  where  to  put  addi-onal  armor  to  maximize  the  chance  that   planes  return.  
  18. 18. Part  3.   Data  Science  
  19. 19. A  picture  of  Cern’s  Large  Hadron  Collider  (LHC).    The  LHC  took  about  a  decade  to  construct,  and  cost  about   $4.75  billion.      Source  of  picture:  Conrad  Melvin,  Crea-ve  Commons  BY-­‐SA  2.0,  www.flickr.com/photos/ 58220828@N07/5350788732   Some  fields  have  (one)  billion  dollar  (or  more)   instrument  that  generates  big  data.  
  20. 20. A  genomics  sequencing  facility  might  have  3-­‐5  next  genera-on  sequencing   instruments  that  cost  $250,000  or  more  each.     Some  fields  have  hundreds  or  thousands  of   million  dollar  instruments  that  in  aggregate   produce  big  data.  
  21. 21. Some  fields  have  millions  of  hundred  dollar   sensors  that  in  aggregate  produce  big  data.  
  22. 22. Math  &   Sta-s-cs   Computer   Science   Disciplinary   Science   Data   Science  
  23. 23. Understanding  Salmon   (A  Cau-onary  Tale)       Source:  Salmo  salar,  (Atlan-c  Salmon),  wikipedia.org    
  24. 24. Methods   Subject.  One  mature  Atlan-c  Salmon  (Salmo  salar)   par-cipated  in  the  fMRI  study.  The  salmon  was   approximately  18  inches  long,  weighed  3.8  lbs,  and  was  not   alive  at  the  -me  of  scanning.     Task.  The  task  administered  to  the  salmon  involved   comple-ng  an  open-­‐ended  mentalizing  task.  The  salmon   was  shown  a  series  of  photographs  depic-ng  human   individuals  in  social  situa-ons  with  a  specified  emo-onal   valence.  The  salmon  was  asked  to  determine  what  emo-on   the  individual  in  the  photo  must  have  been  experiencing.     Design.  S-muli  were  presented  in  a  block  design  with  each   photo  presented  for  10  seconds  followed  by  12  seconds  of   rest.  A  total  of  15  photos  were  displayed.  Total  scan  -me   was  5.5  minutes.      
  25. 25. Several  ac-ve  voxels  were  discovered  in  a  cluster  located  within   the  salmon’s  brain  cavity  (Figure  1,  see  above).  The  size  of  this   cluster  was  81  mm3  with  a  cluster-­‐level  significance  of  p  =  0.001.   Due  to  the  coarse  resolu-on  of  the  echo-­‐planar  image   acquisi-on  and  the  rela-vely  small  size  of  the  salmon  brain   further  discrimina-on  between  brain  regions  could  not  be   completed.  Out  of  a  search  volume  of  8064  voxels  a  total  of  16   voxels  were  significant.      
  26. 26. The  bigger  the  data,  the  easier  it  is  to  do  stupid   things  with  it,  such  as  forgetng  to  correct  for   mul-ple  tests.  
  27. 27. Part  4.   What  Instrument  Do  we  Use  to     Make  Discoveries  in  Data  Science?   How  do  we  build  a  “datascope?”  
  28. 28. experimental   science   simula-on   science   1609   30x   1670   250x   1976   10x-­‐100x   data  science  
  29. 29. experimental   science   simula-on   science   data  science   1609   30x   1670   250x   1976   10x-­‐100x   2004   10x-­‐100x   “Cyberpod”  
  30. 30. Could  we  con-nuously  re-­‐analyze  the  world’s   cancer  data?  
  31. 31. Complex  sta-s-cal   models  over  small  data   that  are  highly  manual   and  update  infrequently.   Simpler  sta-s-cal   models  over  large  data   that  are  highly   automated  and  updated   frequently.   memory   databases   GB   TB   PB   W   KW   MW   datapods   cyber  pods  
  32. 32. Part  5   Five  Trends   Source:  Google  Trends,  for  term  “data  commons”,  www.google.com/trends.  
  33. 33. Trend  1   Data  Commons   Source:  NEXRAD,  NOAA,  www.noaa.org  
  34. 34. The  Standard  Model  of  Biomedical   Compu-ng  No  Longer  Works   Public  data   repositories   Private  local   storage  &   compute   Network   download   Local  data  ($1K)   Community   souware   Souware,  sweat  and   tears  ($100K)  
  35. 35. Data  Commons   Data  commons  co-­‐locate  data,  storage  and  compu-ng   infrastructure,  and  commonly  used  tools  for  analyzing   and  sharing  data  to  create  a  resource  for  the  research   community.   Source:  Interior  of  one  of  Google’s  data  centers,  www.google.com/about/datacenters/  
  36. 36. Open  Science  Data  Cloud   (Open  Cloud  Consor-um,   2012)   NCI  Data  Commons     (UChicago,  Nov   2015)   Bionimbus  Protected   Data  Cloud  (UChicago,   2013)   NOAA  Data   Commons     (Open   Cloud   Consor-um Oct  2015)  
  37. 37. Purple  balls  are  lung  adenocarcinoma.    Grey  are  lung   squamous  cell  carcinoma.    Green  are  misdiagnosed.    
  38. 38. Hospitals,  medical   research  centers   and  doctors   Data  commons  containing     genomic  and  clinical  data.   Pa-ents   Output:  con-nuously   updated,  data-­‐driven,     analy-cs-­‐informed     discovery,  diagnosis   and  treatment.  
  39. 39. Trend  2   Analy-cs  of  Things,  People  and  Places   Source:  Urban  sensor  on  street  pole  in  Chicago  (conceptual),  arrayouhings.github.io/  
  40. 40. People  and  things  genera-ng  streaming     data  that  are  relevant  for  research.  
  41. 41. Places  that  generate  data   Source:  Jane  Macfarlane,  Here,  a  Division  of  Nokia.  
  42. 42. Trend  3   Languages  for  Data,  Sta-s-cal  Models,  Data   Science  Workflows  &  Exploratory  Data  Analysis   Source:  M.  Bostock,  hnp://bl.ocks.org/mbostock/4063318  
  43. 43. Portable  Format  for  Analy-cs  (PFA)   Predic-ve  Model  Markup  Language  (PMML)   Grammar  of  Graphics   d3.js  
  44. 44. Trend  4   More  Policies  That  Make  Data  Available   and  Analy-cs  Repeatable  
  45. 45. Execu-ve  Order  13642  (May  9,  2013)   Making  Open  and  Machine  Readable  the  Default  for   Government  Informa-on  (“Open  Data  Policy”)   OMB  Guidance  President’s  Ex  Order  
  46. 46. Trend  5   Transla-onal  Data  Science   How  do  we  translate  data  driven  discoveries   into  ac-ons  that  impact  society?    
  47. 47. Imaging Informatics Clinical Informatics Bioinformatics Public Health Informatics Basic Research Applied Research Practice (dx, treatment and prevention) Molecular & cellular processes Tissues & organs Individuals (patients) Groups & populations Quality & outcomes Translational Informatics
  48. 48. New  algorithms,   new  sta-s-cal   models  (data   science)   Applica-ons  to   genomics,  analysis   of  EMR,  etc.   Souware  stacks  for  data   intensive  compu-ng   (data  engineering)   Data  driven   discoveries   Data  driven   diagnosis   Data  driven   therapeu-cs   Develop  souware  stack  that  scales  to  a  “datapod”,  to  create   “commons”  for  data  driven  discoveries,  dx  &  treatment.    (Core   strategy  for  Center  for  Data  Intensive  Science,  University  of  Chicago)   Transla-onal  Data  Science  
  49. 49. Source:  Maria  T.  Panerson  and  Robert  L.  Grossman,  Detec-ng  localized  spa-al  panerns  of  disease   incidence  using  a  neighbor-­‐based  bootstrapping  method  on  electronic  medical  records  data  from  99.1   million  pa-ents,  to  appear.  
  50. 50. Part  5   Five  Challenges  
  51. 51. Challenge  1.  Is  More  Different?       Source:  P.  W.  Anderson,  More  is  Different,  Science,  Volume  177,  Number  4047,  4  August  1972,  pages  393-­‐396.   Do  New  Phenomena  Emerge  at  Scale  in  Data?  
  52. 52. Challenge  2.  One  Million  Genomes   •  Sequencing  a  million  genomes  would  likely  change   the  way  we  understand  genomic  varia-on  and   provide  a  founda-on  for  precision  medicine.   •  The  genomic  data  for  a  pa-ent  is  about  1  TB   (including  samples  from  both  tumor  and  normal   -ssue).   •  One  million  genomes  is  about  1000  PB  or  1  EB   •  With  compression,  it  may  be  about  100  PB   •  At  $1000/genome,  the  sequencing  would  cost  about   $1B   •  Think  of  this  as  one  hundred  studies  with  10,000   pa-ents  each  over  three  years.  
  53. 53. Challenge  3.    Datapods   •  Databases  have  fundamentally  changed  the  way  we   manage  and  analyze  scien-fic  data.     •  NoSQL  databases  allow  us  to  scale  out  to  mul-ple   racks  of  computers,  but  are  hard  to  to  operate.   •  If  our  scien-fic  instrument  for  data  science  is  a   cyberpod  of  hardware  and  a  souware  stack   suppor-ng  data  analysis,  we  need  a  simple-­‐to-­‐ manage,  open  source  “database”  that  scales  to  a   cyberpod.   •  Call  this  a  “datapod.”   •  It  could  support  open  source  data  commons  and   allow  them  to  peer.  
  54. 54. Challenge  4.    A  Billion  Predic-ve  Models   •  Develop  technology  to  generate  automa-cally  1  to   10  billion  heterogeneous  segmented  models   •   Applica-ons   – George  Church’s  challenge  individual  predic-ve   models  for  each  human  genome  6.5  Billion   humans.   – 1  Million  cancer  genomes  x  1,000  models  /   genome.   – Urban  science  –  instrumen-ng  ci-es.   – Consumer  Marke-ng  -­‐  large  adver-sers  will  see   1-­‐3  billion  different  consumers    
  55. 55. Challenge  5.    HDSI   •  Human  Computer  Interac-on  (HCI)  was  an  important   field  before  everyone  got  a  computer  and  became  an   expert.   •  Think  of  Human  Data  Science  Interac-on  (HDSI)  of   how  humans  interact  with  the  souware  suppor-ng   the  analysis  of  data  science  at  the  scale  of  datapods   with  billion  models  and  trillions  of  hypotheses.   •  How  can  we  improve  the  interac-on  to  improve  how   we  semi-­‐automa-cally  integrate  data,  validate   hypotheses,  interac-vely  explore  data,  etc.  
  56. 56. Ques-ons?   59   rgrossman.com   @bobgrossman  
  57. 57. For  More  Informa-on   cdis.uchicago.edu   www.opendatagroup.com   rgrossman.com  

×