Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)


Published on

This is Chapter 1 of a tutorial that I gave at SC 11 on November 14, 2011.

This is Chapter 1 of a tutorial that I gave at SC 11 on November 14, 2011.

Published in: Technology

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. An  Introduc+on  to    Data  Intensive  Compu+ng     Chapter  1:  Introduc+on   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneB   Open  Data  Group     November  14,  2011   1  
  • 2. 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)  2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)  3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems  4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3. Our  perspec+ve  is  to  consider  data  intensive  compu+ng  from  the  viewpoint  of  u+lity  and  data  clouds.      For  the  most  current  version  of  these  notes,  please  see:  
  • 4. Sec+on  1.1    Data  Intensive  Science   Two  of  the  14  high  throughput  sequencers  at  the   Ontario  Ins+tute  for  Cancer  Research  (OICR).       4  
  • 5. Moore’s  law  also  applies  to  the  instruments  that  are  producing  data.    This  is  crea+ng  new  paradigms:  “data  intensive  science”  and  “data  intensive  compu+ng.”  
  • 6. Source:  Lincoln  Stein  
  • 7. Data  is  Big  If  It  is  Measured  in  MW   •  Data  is  big  if  you  measure  it  in   MegawaBs.   •  As  in,  a  good  sweet  spot  for  a   data  center  is  15  MW.   •  As  in,  Facebook’s  leased  data   centers  are  typically  between   2.5  MW  and  6.0  MW.   •  Facebook’s  new  Pineville  data   center  is  30  MW.   •  Google’s  compu+ng   infrastructure  uses  260  MW.  
  • 8. Some  Big  Data  Sciences  Discipline   Dura-on   Size   #  Devices  HEP  -­‐  LHC   10  years   15  PB/year*   One  Astronomy  -­‐  LSST   10  years   12  PB/year**   One  Genomics  -­‐  NGS   2-­‐4  years   0.4  TB/genome   1000’s  *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  worlds  largest  par+cle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambi+ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hBp://­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resul+ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hBp://­‐1004.html  
  • 9. An  algorithm  and   compu+ng  infrastructure   is  “big-­‐data  scalable”  if   adding  a  rack  of  data  (and   corresponding  processors)   does  not  increase  the  +me   required  to  complete  the   computa+on  but  increases   the  amount  of  data  that   can  be  processed.  Add  capacity  with  constant  +me  (ACCT)  
  • 10. Sec+on  1.2  What’s  New  with  Clouds?   10  
  • 11. The  Term  ‘In  the  Cloud’  is  Annoying    •  “Personally,  I  find  the  term  ‘in  the  cloud’   preten+ous  and  annoying.  …  the  world’s   marketers  and  P.R.  people  seem  to  think  that   ‘the  cloud’  just  means  ‘online.’  ”    David  Pogue,   NYT  June  16,  2011.        •  More  specifically  he  notes  that  you  can  think   of  the  cloud  as  “data  and  applica+on  sopware   stored  on  remote  servers  [and  accessed  via   the  Internet]”  
  • 12. U+lity  Clouds  Infrastructure  as  a  Service  (IaaS)   Amazon  Data  Center   12  
  • 13. Data  Clouds  Large  Data  Cloud  Services   ad  targe+ng     Yahoo  Data  Center   13  
  • 14. Virtualiza+on   App   App   App   OS  App   App   App   OS   OS   OS   Hyperviser   Computer   Computer   14  
  • 15. Idea  Dates  Back  to  the  1960s   App   App   App   CMS   MVS   CMS   IBM  VM/370   IBM  Mainframe   Na+ve  (Full)  Virtualiza+on   Examples:  Vmware  ESX  •  Virtualiza+on  first  widely  deployed  with  IBM   VM/370.   15  
  • 16. Scale  is  New   16  
  • 17. Usage  Based  Pricing  Is  New   costs  the  same  as  1  computer  in  a  rack   120  computers  in    three  for  120  hours   racks  for  1  hour   17  
  • 18. Simplicity  is  New   +   ..  and  you  have  a  computer   ready  to  work.  Elas+c,  on  demand  provisioning.  A  new  programmer  can  develop  a  program  to  process  a  container  full  of  data  with  less  than  day  of  training  using  MapReduce.   18  
  • 19. Sec+on  1.4    U+lity  Clouds  
  • 20. Customer’s   Cloud  Service  Provider’s  Responsibility   Responsibility   IaaS   PaaS   SaaS   Apps   Apps   Apps   Frameworks   Frameworks   Frameworks   VM   VM   VM   Hyperviser,   Hyperviser,   Hyperviser,   network   network   network  
  • 21. Amazon  Style  Data  Cloud   Load  Balancer   Simple  Queue  Service  SDB   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   EC2  Instances   S3  Storage  Services   21
  • 22. NIST  Defini+on  •  Cloud  compu+ng  is  a  model  for  enabling   ubiquitous,  convenient,  on-­‐demand  network   access  to  a  shared  pool  of  configurable   compu+ng  resources  that  can  be  rapidly   provisioned  and  released  with  minimal   management  effort  or  service  provider   interac+on.  
  • 23. NIST  Defini+on  Essential Characteristics Deployment Models •  On-demand / self-service •  Private •  Broad network access •  Community •  Resource pooling •  Public •  Rapid elasticity •  Hybrid •  Measured serviceService Models •  Software as a Service (SaaS) – consumer runs provider s applications on cloud infrastructure •  Platform as a Service (PaaS) – consumer runs consumer-created applications on the cloud using tools supported by provider •  Infrastructure as a Service (IaaS) – consumer uses provider s processing, storage, and networks
  • 24. Sec+on  1.5  Data  Clouds  
  • 25. Google’s  Large  Data  Cloud   Applica+ons   Compute  Services   Google’s  MapReduce  Data  Services   Google’s  BigTable   Storage  Services   Google  File  System  (GFS)   Google’s  Stack   25
  • 26. Hadoop’s  Large  Data  Cloud   Applica+ons   Compute  Services   Hadoop’s  MapReduce  Data  Services   NoSQL  Databases   Storage  Services   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  Stack   26
  • 27. Ques+ons?