An	  Introduc+on	  to	  	  Data	  Intensive	  Compu+ng	                    	   Chapter	  1:	  Introduc+on	         Robert	...
1.  Introduc+on	  (0830-­‐0900)	       a.  Data	  clouds	  (e.g.	  Hadoop)	       b.  U+lity	  clouds	  (e.g.	  Amazon)	  ...
Our	  perspec+ve	  is	  to	  consider	  data	  intensive	  compu+ng	  from	  the	  viewpoint	  of	  u+lity	  and	  data	  ...
Sec+on	  1.1	  	  Data	  Intensive	  Science	              Two	  of	  the	  14	  high	  throughput	  sequencers	  at	  the...
Moore’s	  law	  also	  applies	  to	  the	  instruments	  that	  are	  producing	  data.	  	  This	  is	  crea+ng	  new	  ...
Source:	  Lincoln	  Stein	  
Data	  is	  Big	  If	  It	  is	  Measured	  in	  MW	                       •  Data	  is	  big	  if	  you	  measure	  it	  ...
Some	  Big	  Data	  Sciences	  Discipline	                                                       Dura-on	   Size	         ...
An	  algorithm	  and	                                  compu+ng	  infrastructure	                                  is	  “b...
Sec+on	  1.2	  What’s	  New	  with	  Clouds?	                                         10	  
The	  Term	  ‘In	  the	  Cloud’	  is	  Annoying	  	  •  “Personally,	  I	  find	  the	  term	  ‘in	  the	  cloud’	     pret...
U+lity	  Clouds	  Infrastructure	  as	  a	  Service	  (IaaS)	          Amazon	  Data	  Center	                            ...
Data	  Clouds	  Large	  Data	  Cloud	  Services	                                                      ad	  targe+ng	  	   ...
Virtualiza+on	                                                         App	                                               ...
Idea	  Dates	  Back	  to	  the	  1960s	                       App	             App	              App	                     ...
Scale	  is	  New	                           16	  
Usage	  Based	  Pricing	  Is	  New	                        costs	  the	  same	  as	  1	  computer	  in	  a	  rack	        ...
Simplicity	  is	  New	                    +	                          ..	  and	  you	  have	  a	  computer	               ...
Sec+on	  1.4	  	  U+lity	  Clouds	  
Customer’s	                    Cloud	  Service	  Provider’s	  Responsibility	                Responsibility	              ...
Amazon	  Style	  Data	  Cloud	                                 Load	  Balancer	                             Simple	  Queue...
NIST	  Defini+on	  •  Cloud	  compu+ng	  is	  a	  model	  for	  enabling	     ubiquitous,	  convenient,	  on-­‐demand	  net...
NIST	  Defini+on	  Essential Characteristics             Deployment Models  •  On-demand / self-service            •    Pri...
Sec+on	  1.5	  Data	  Clouds	  
Google’s	  Large	  Data	  Cloud	        Applica+ons	    Compute	  Services	      Google’s	  MapReduce	  Data	  Services	  ...
Hadoop’s	  Large	  Data	  Cloud	        Applica+ons	    Compute	  Services	      Hadoop’s	  MapReduce	  Data	  Services	  ...
Upcoming SlideShare
Loading in...5

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)


Published on

This is Chapter 1 of a tutorial that I gave at SC 11 on November 14, 2011.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)

  1. 1. An  Introduc+on  to    Data  Intensive  Compu+ng     Chapter  1:  Introduc+on   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneB   Open  Data  Group     November  14,  2011   1  
  2. 2. 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)  2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)  3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems  4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  3. 3. Our  perspec+ve  is  to  consider  data  intensive  compu+ng  from  the  viewpoint  of  u+lity  and  data  clouds.      For  the  most  current  version  of  these  notes,  please  see:  
  4. 4. Sec+on  1.1    Data  Intensive  Science   Two  of  the  14  high  throughput  sequencers  at  the   Ontario  Ins+tute  for  Cancer  Research  (OICR).       4  
  5. 5. Moore’s  law  also  applies  to  the  instruments  that  are  producing  data.    This  is  crea+ng  new  paradigms:  “data  intensive  science”  and  “data  intensive  compu+ng.”  
  6. 6. Source:  Lincoln  Stein  
  7. 7. Data  is  Big  If  It  is  Measured  in  MW   •  Data  is  big  if  you  measure  it  in   MegawaBs.   •  As  in,  a  good  sweet  spot  for  a   data  center  is  15  MW.   •  As  in,  Facebook’s  leased  data   centers  are  typically  between   2.5  MW  and  6.0  MW.   •  Facebook’s  new  Pineville  data   center  is  30  MW.   •  Google’s  compu+ng   infrastructure  uses  260  MW.  
  8. 8. Some  Big  Data  Sciences  Discipline   Dura-on   Size   #  Devices  HEP  -­‐  LHC   10  years   15  PB/year*   One  Astronomy  -­‐  LSST   10  years   12  PB/year**   One  Genomics  -­‐  NGS   2-­‐4  years   0.4  TB/genome   1000’s  *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  worlds  largest  par+cle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambi+ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hBp://­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resul+ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hBp://­‐1004.html  
  9. 9. An  algorithm  and   compu+ng  infrastructure   is  “big-­‐data  scalable”  if   adding  a  rack  of  data  (and   corresponding  processors)   does  not  increase  the  +me   required  to  complete  the   computa+on  but  increases   the  amount  of  data  that   can  be  processed.  Add  capacity  with  constant  +me  (ACCT)  
  10. 10. Sec+on  1.2  What’s  New  with  Clouds?   10  
  11. 11. The  Term  ‘In  the  Cloud’  is  Annoying    •  “Personally,  I  find  the  term  ‘in  the  cloud’   preten+ous  and  annoying.  …  the  world’s   marketers  and  P.R.  people  seem  to  think  that   ‘the  cloud’  just  means  ‘online.’  ”    David  Pogue,   NYT  June  16,  2011.        •  More  specifically  he  notes  that  you  can  think   of  the  cloud  as  “data  and  applica+on  sopware   stored  on  remote  servers  [and  accessed  via   the  Internet]”  
  12. 12. U+lity  Clouds  Infrastructure  as  a  Service  (IaaS)   Amazon  Data  Center   12  
  13. 13. Data  Clouds  Large  Data  Cloud  Services   ad  targe+ng     Yahoo  Data  Center   13  
  14. 14. Virtualiza+on   App   App   App   OS  App   App   App   OS   OS   OS   Hyperviser   Computer   Computer   14  
  15. 15. Idea  Dates  Back  to  the  1960s   App   App   App   CMS   MVS   CMS   IBM  VM/370   IBM  Mainframe   Na+ve  (Full)  Virtualiza+on   Examples:  Vmware  ESX  •  Virtualiza+on  first  widely  deployed  with  IBM   VM/370.   15  
  16. 16. Scale  is  New   16  
  17. 17. Usage  Based  Pricing  Is  New   costs  the  same  as  1  computer  in  a  rack   120  computers  in    three  for  120  hours   racks  for  1  hour   17  
  18. 18. Simplicity  is  New   +   ..  and  you  have  a  computer   ready  to  work.  Elas+c,  on  demand  provisioning.  A  new  programmer  can  develop  a  program  to  process  a  container  full  of  data  with  less  than  day  of  training  using  MapReduce.   18  
  19. 19. Sec+on  1.4    U+lity  Clouds  
  20. 20. Customer’s   Cloud  Service  Provider’s  Responsibility   Responsibility   IaaS   PaaS   SaaS   Apps   Apps   Apps   Frameworks   Frameworks   Frameworks   VM   VM   VM   Hyperviser,   Hyperviser,   Hyperviser,   network   network   network  
  21. 21. Amazon  Style  Data  Cloud   Load  Balancer   Simple  Queue  Service  SDB   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   EC2  Instances   S3  Storage  Services   21
  22. 22. NIST  Defini+on  •  Cloud  compu+ng  is  a  model  for  enabling   ubiquitous,  convenient,  on-­‐demand  network   access  to  a  shared  pool  of  configurable   compu+ng  resources  that  can  be  rapidly   provisioned  and  released  with  minimal   management  effort  or  service  provider   interac+on.  
  23. 23. NIST  Defini+on  Essential Characteristics Deployment Models •  On-demand / self-service •  Private •  Broad network access •  Community •  Resource pooling •  Public •  Rapid elasticity •  Hybrid •  Measured serviceService Models •  Software as a Service (SaaS) – consumer runs provider s applications on cloud infrastructure •  Platform as a Service (PaaS) – consumer runs consumer-created applications on the cloud using tools supported by provider •  Infrastructure as a Service (IaaS) – consumer uses provider s processing, storage, and networks
  24. 24. Sec+on  1.5  Data  Clouds  
  25. 25. Google’s  Large  Data  Cloud   Applica+ons   Compute  Services   Google’s  MapReduce  Data  Services   Google’s  BigTable   Storage  Services   Google  File  System  (GFS)   Google’s  Stack   25
  26. 26. Hadoop’s  Large  Data  Cloud   Applica+ons   Compute  Services   Hadoop’s  MapReduce  Data  Services   NoSQL  Databases   Storage  Services   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  Stack   26
  27. 27. Ques+ons?  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.