An	
  Introduc+on	
  to	
  	
  
Data	
  Intensive	
  Compu+ng	
  
	
  
Chapter	
  1:	
  Introduc+on	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
Open	
  Data	
  Group	
  
	
  
Collin	
  BenneB	
  
Open	
  Data	
  Group	
  
	
  
November	
  14,	
  2011	
  
1	
  
1.  Introduc+on	
  (0830-­‐0900)	
  
a.  Data	
  clouds	
  (e.g.	
  Hadoop)	
  
b.  U+lity	
  clouds	
  (e.g.	
  Amazon)	
  
2.  Managing	
  Big	
  Data	
  (0900-­‐0945)	
  
a.  Databases	
  
b.  Distributed	
  File	
  Systems	
  (e.g.	
  Hadoop)	
  
c.  NoSql	
  databases	
  (e.g.	
  HBase)	
  
3.  Processing	
  Big	
  Data	
  (0945-­‐1000	
  and	
  1030-­‐1100)	
  
a.  Mul+ple	
  Virtual	
  Machines	
  &	
  Message	
  Queues	
  
b.  MapReduce	
  
c.  Streams	
  over	
  distributed	
  file	
  systems	
  
4.  Lab	
  using	
  Amazon’s	
  Elas+c	
  Map	
  Reduce	
  
(1100-­‐1200)	
  
	
  
For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  
please	
  see:	
  	
  
rgrossman.com	
  
Our	
  perspec+ve	
  is	
  to	
  consider	
  data	
  intensive	
  
compu+ng	
  from	
  the	
  viewpoint	
  of	
  u+lity	
  and	
  
data	
  clouds.	
  	
  	
  
Sec+on	
  1.1	
  	
  
Data	
  Intensive	
  Science	
  
4	
  
Two	
  of	
  the	
  14	
  high	
  throughput	
  sequencers	
  at	
  the	
  
Ontario	
  Ins+tute	
  for	
  Cancer	
  Research	
  (OICR).	
  	
  	
  
Moore’s	
  law	
  also	
  
applies	
  to	
  the	
  
instruments	
  that	
  are	
  
producing	
  data.	
  
	
  
This	
  is	
  crea+ng	
  new	
  
paradigms:	
  “data	
  
intensive	
  science”	
  
and	
  “data	
  intensive	
  
compu+ng.”	
  
Source:	
  Lincoln	
  Stein	
  
Data	
  is	
  Big	
  If	
  It	
  is	
  Measured	
  in	
  MW	
  
•  Data	
  is	
  big	
  if	
  you	
  measure	
  it	
  in	
  
MegawaBs.	
  
•  As	
  in,	
  a	
  good	
  sweet	
  spot	
  for	
  a	
  
data	
  center	
  is	
  15	
  MW.	
  
•  As	
  in,	
  Facebook’s	
  leased	
  data	
  
centers	
  are	
  typically	
  between	
  
2.5	
  MW	
  and	
  6.0	
  MW.	
  
•  Facebook’s	
  new	
  Pineville	
  data	
  
center	
  is	
  30	
  MW.	
  
•  Google’s	
  compu+ng	
  
infrastructure	
  uses	
  260	
  MW.	
  
Discipline	
   Dura-on	
   Size	
   #	
  Devices	
  
HEP	
  -­‐	
  LHC	
   10	
  years	
   15	
  PB/year*	
   One	
  
Astronomy	
  -­‐	
  LSST	
   10	
  years	
   12	
  PB/year**	
   One	
  
Genomics	
  -­‐	
  NGS	
   2-­‐4	
  years	
   0.4	
  TB/genome	
   1000’s	
  
Some	
  Big	
  Data	
  Sciences	
  
*At	
  full	
  capacity,	
  the	
  Large	
  Hadron	
  Collider	
  (LHC),	
  the	
  world's	
  largest	
  par+cle	
  accelerator,	
  is	
  expected	
  to	
  produce	
  more	
  than	
  15	
  
million	
  Gigabytes	
  of	
  data	
  each	
  year.	
  	
  …	
  This	
  ambi+ous	
  project	
  connects	
  and	
  combines	
  the	
  IT	
  power	
  of	
  more	
  than	
  140	
  computer	
  
centres	
  in	
  33	
  countries.	
  	
  Source:	
  hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html	
  
	
  
**As	
  it	
  carries	
  out	
  its	
  10-­‐year	
  survey,	
  LSST	
  will	
  produce	
  over	
  15	
  terabytes	
  of	
  raw	
  astronomical	
  data	
  each	
  night	
  (30	
  terabytes	
  
processed),	
  resul+ng	
  in	
  a	
  database	
  catalog	
  of	
  22	
  petabytes	
  and	
  an	
  image	
  archive	
  of	
  100	
  petabytes.	
  	
  Source:	
  hBp://www.lsst.org/
News/enews/teragrid-­‐1004.html	
  
An	
  algorithm	
  and	
  
compu+ng	
  infrastructure	
  
is	
  “big-­‐data	
  scalable”	
  if	
  
adding	
  a	
  rack	
  of	
  data	
  (and	
  
corresponding	
  processors)	
  
does	
  not	
  increase	
  the	
  +me	
  
required	
  to	
  complete	
  the	
  
computa+on	
  but	
  increases	
  
the	
  amount	
  of	
  data	
  that	
  
can	
  be	
  processed.	
  
Add	
  capacity	
  with	
  
constant	
  +me	
  (ACCT)	
  
Sec+on	
  1.2	
  
What’s	
  New	
  with	
  Clouds?	
  
10	
  
The	
  Term	
  ‘In	
  the	
  Cloud’	
  is	
  Annoying	
  	
  
•  “Personally,	
  I	
  find	
  the	
  term	
  ‘in	
  the	
  cloud’	
  
preten+ous	
  and	
  annoying.	
  …	
  the	
  world’s	
  
marketers	
  and	
  P.R.	
  people	
  seem	
  to	
  think	
  that	
  
‘the	
  cloud’	
  just	
  means	
  ‘online.’	
  ”	
  	
  David	
  Pogue,	
  
NYT	
  June	
  16,	
  2011.	
  	
  	
  	
  
•  More	
  specifically	
  he	
  notes	
  that	
  you	
  can	
  think	
  
of	
  the	
  cloud	
  as	
  “data	
  and	
  applica+on	
  sopware	
  
stored	
  on	
  remote	
  servers	
  [and	
  accessed	
  via	
  
the	
  Internet]”	
  
U+lity	
  Clouds	
  
12	
  
Infrastructure	
  as	
  a	
  Service	
  (IaaS)	
  
Amazon	
  Data	
  Center	
  
Data	
  Clouds	
  
13	
  
ad	
  targe+ng	
  	
  
Large	
  Data	
  Cloud	
  Services	
  
Yahoo	
  Data	
  Center	
  
Virtualiza+on	
  
14	
  
App	
  
OS	
  
App	
  
OS	
  
App	
  
OS	
  
Hyperviser	
  
Computer	
  
App	
  
OS	
  
Computer	
  
App	
   App	
  
Idea	
  Dates	
  Back	
  to	
  the	
  1960s	
  
•  Virtualiza+on	
  first	
  widely	
  deployed	
  with	
  IBM	
  
VM/370.	
  
15	
  
IBM	
  Mainframe	
  
IBM	
  VM/370	
  
CMS	
  
App	
  
Na+ve	
  (Full)	
  Virtualiza+on	
  
Examples:	
  Vmware	
  ESX	
  
MVS	
  
App	
  
CMS	
  
App	
  
16	
  
Scale	
  is	
  New	
  
Usage	
  Based	
  Pricing	
  Is	
  New	
  
17	
  
1	
  computer	
  in	
  a	
  rack	
  
for	
  120	
  hours	
  
120	
  computers	
  in	
  	
  three	
  
racks	
  for	
  1	
  hour	
  
costs	
  the	
  same	
  as	
  
Simplicity	
  is	
  New	
  
18	
  
+	
   ..	
  and	
  you	
  have	
  a	
  computer	
  
ready	
  to	
  work.	
  
A	
  new	
  programmer	
  can	
  develop	
  a	
  
program	
  to	
  process	
  a	
  container	
  full	
  
of	
  data	
  with	
  less	
  than	
  day	
  of	
  
training	
  using	
  MapReduce.	
  
Elas+c,	
  on	
  demand	
  provisioning.	
  
Sec+on	
  1.4	
  	
  
U+lity	
  Clouds	
  
Hyperviser,	
  
network	
  
Hyperviser,	
  
network	
  
Hyperviser,	
  
network	
  
Apps	
  
VM	
  
Frameworks	
  
Apps	
  
VM	
  
Frameworks	
  
Apps	
  
VM	
  
Frameworks	
  
Customer’s	
  
Responsibility	
  
Cloud	
  Service	
  Provider’s	
  
Responsibility	
  
IaaS	
   PaaS	
   SaaS	
  
Amazon	
  Style	
  Data	
  Cloud	
  
S3	
  Storage	
  Services	
  
Simple	
  Queue	
  Service	
  
21
Load	
  Balancer	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instances	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instances	
  
SDB	
  
NIST	
  Defini+on	
  
•  Cloud	
  compu+ng	
  is	
  a	
  model	
  for	
  enabling	
  
ubiquitous,	
  convenient,	
  on-­‐demand	
  network	
  
access	
  to	
  a	
  shared	
  pool	
  of	
  configurable	
  
compu+ng	
  resources	
  that	
  can	
  be	
  rapidly	
  
provisioned	
  and	
  released	
  with	
  minimal	
  
management	
  effort	
  or	
  service	
  provider	
  
interac+on.	
  
Essential Characteristics
•  On-demand / self-service
•  Broad network access
•  Resource pooling
•  Rapid elasticity
•  Measured service
Service Models
•  Software as a Service (SaaS) – consumer runs
provider s applications on cloud infrastructure
•  Platform as a Service (PaaS) – consumer runs
consumer-created applications on the cloud
using tools supported by provider
•  Infrastructure as a Service (IaaS) – consumer uses
provider s processing, storage, and networks
Deployment Models
•  Private
•  Community
•  Public
•  Hybrid
NIST	
  Defini+on	
  
Sec+on	
  1.5	
  
Data	
  Clouds	
  
Google’s	
  Large	
  Data	
  Cloud	
  
Storage	
  Services	
  
Data	
  Services	
  
Compute	
  Services	
  
25
Google’s	
  Stack	
  
Applica+ons	
  
Google	
  File	
  System	
  (GFS)	
  
Google’s	
  MapReduce	
  
Google’s	
  BigTable	
  
Hadoop’s	
  Large	
  Data	
  Cloud	
  
Storage	
  Services	
  
Compute	
  Services	
  
26
Hadoop’s	
  Stack	
  
Applica+ons	
  
Hadoop	
  Distributed	
  File	
  
System	
  (HDFS)	
  
Hadoop’s	
  MapReduce	
  
Data	
  Services	
   NoSQL	
  Databases	
  
Ques+ons?	
  

An Introduction to Data Intensive Computing

  • 1.
    An  Introduc+on  to     Data  Intensive  Compu+ng     Chapter  1:  Introduc+on   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneB   Open  Data  Group     November  14,  2011   1  
  • 2.
    1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)   2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)   3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems   4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3.
    For  the  most  current  version  of  these  notes,   please  see:     rgrossman.com   Our  perspec+ve  is  to  consider  data  intensive   compu+ng  from  the  viewpoint  of  u+lity  and   data  clouds.      
  • 4.
    Sec+on  1.1     Data  Intensive  Science   4   Two  of  the  14  high  throughput  sequencers  at  the   Ontario  Ins+tute  for  Cancer  Research  (OICR).      
  • 5.
    Moore’s  law  also   applies  to  the   instruments  that  are   producing  data.     This  is  crea+ng  new   paradigms:  “data   intensive  science”   and  “data  intensive   compu+ng.”  
  • 6.
  • 7.
    Data  is  Big  If  It  is  Measured  in  MW   •  Data  is  big  if  you  measure  it  in   MegawaBs.   •  As  in,  a  good  sweet  spot  for  a   data  center  is  15  MW.   •  As  in,  Facebook’s  leased  data   centers  are  typically  between   2.5  MW  and  6.0  MW.   •  Facebook’s  new  Pineville  data   center  is  30  MW.   •  Google’s  compu+ng   infrastructure  uses  260  MW.  
  • 8.
    Discipline   Dura-on   Size   #  Devices   HEP  -­‐  LHC   10  years   15  PB/year*   One   Astronomy  -­‐  LSST   10  years   12  PB/year**   One   Genomics  -­‐  NGS   2-­‐4  years   0.4  TB/genome   1000’s   Some  Big  Data  Sciences   *At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par+cle  accelerator,  is  expected  to  produce  more  than  15   million  Gigabytes  of  data  each  year.    …  This  ambi+ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer   centres  in  33  countries.    Source:  hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html     **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes   processed),  resul+ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hBp://www.lsst.org/ News/enews/teragrid-­‐1004.html  
  • 9.
    An  algorithm  and   compu+ng  infrastructure   is  “big-­‐data  scalable”  if   adding  a  rack  of  data  (and   corresponding  processors)   does  not  increase  the  +me   required  to  complete  the   computa+on  but  increases   the  amount  of  data  that   can  be  processed.   Add  capacity  with   constant  +me  (ACCT)  
  • 10.
    Sec+on  1.2   What’s  New  with  Clouds?   10  
  • 11.
    The  Term  ‘In  the  Cloud’  is  Annoying     •  “Personally,  I  find  the  term  ‘in  the  cloud’   preten+ous  and  annoying.  …  the  world’s   marketers  and  P.R.  people  seem  to  think  that   ‘the  cloud’  just  means  ‘online.’  ”    David  Pogue,   NYT  June  16,  2011.         •  More  specifically  he  notes  that  you  can  think   of  the  cloud  as  “data  and  applica+on  sopware   stored  on  remote  servers  [and  accessed  via   the  Internet]”  
  • 12.
    U+lity  Clouds   12   Infrastructure  as  a  Service  (IaaS)   Amazon  Data  Center  
  • 13.
    Data  Clouds   13   ad  targe+ng     Large  Data  Cloud  Services   Yahoo  Data  Center  
  • 14.
    Virtualiza+on   14   App   OS   App   OS   App   OS   Hyperviser   Computer   App   OS   Computer   App   App  
  • 15.
    Idea  Dates  Back  to  the  1960s   •  Virtualiza+on  first  widely  deployed  with  IBM   VM/370.   15   IBM  Mainframe   IBM  VM/370   CMS   App   Na+ve  (Full)  Virtualiza+on   Examples:  Vmware  ESX   MVS   App   CMS   App  
  • 16.
  • 17.
    Usage  Based  Pricing  Is  New   17   1  computer  in  a  rack   for  120  hours   120  computers  in    three   racks  for  1  hour   costs  the  same  as  
  • 18.
    Simplicity  is  New   18   +   ..  and  you  have  a  computer   ready  to  work.   A  new  programmer  can  develop  a   program  to  process  a  container  full   of  data  with  less  than  day  of   training  using  MapReduce.   Elas+c,  on  demand  provisioning.  
  • 19.
    Sec+on  1.4     U+lity  Clouds  
  • 20.
    Hyperviser,   network   Hyperviser,   network   Hyperviser,   network   Apps   VM   Frameworks   Apps   VM   Frameworks   Apps   VM   Frameworks   Customer’s   Responsibility   Cloud  Service  Provider’s   Responsibility   IaaS   PaaS   SaaS  
  • 21.
    Amazon  Style  Data  Cloud   S3  Storage  Services   Simple  Queue  Service   21 Load  Balancer   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   SDB  
  • 22.
    NIST  Defini+on   • Cloud  compu+ng  is  a  model  for  enabling   ubiquitous,  convenient,  on-­‐demand  network   access  to  a  shared  pool  of  configurable   compu+ng  resources  that  can  be  rapidly   provisioned  and  released  with  minimal   management  effort  or  service  provider   interac+on.  
  • 23.
    Essential Characteristics •  On-demand/ self-service •  Broad network access •  Resource pooling •  Rapid elasticity •  Measured service Service Models •  Software as a Service (SaaS) – consumer runs provider s applications on cloud infrastructure •  Platform as a Service (PaaS) – consumer runs consumer-created applications on the cloud using tools supported by provider •  Infrastructure as a Service (IaaS) – consumer uses provider s processing, storage, and networks Deployment Models •  Private •  Community •  Public •  Hybrid NIST  Defini+on  
  • 24.
  • 25.
    Google’s  Large  Data  Cloud   Storage  Services   Data  Services   Compute  Services   25 Google’s  Stack   Applica+ons   Google  File  System  (GFS)   Google’s  MapReduce   Google’s  BigTable  
  • 26.
    Hadoop’s  Large  Data  Cloud   Storage  Services   Compute  Services   26 Hadoop’s  Stack   Applica+ons   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  MapReduce   Data  Services   NoSQL  Databases  
  • 27.