An	
  Introduc+on	
  to	
  	
  
Data	
  Intensive	
  Compu+ng	
  
	
  
Chapter	
  2:	
  Data	
  Management	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
Open	
  Data	
  Group	
  
	
  
Collin	
  BenneC	
  
Open	
  Data	
  Group	
  
	
  
November	
  14,	
  2011	
  
1	
  
1.  Introduc+on	
  (0830-­‐0900)	
  
a.  Data	
  clouds	
  (e.g.	
  Hadoop)	
  
b.  U+lity	
  clouds	
  (e.g.	
  Amazon)	
  
2.  Managing	
  Big	
  Data	
  (0900-­‐0945)	
  
a.  Databases	
  
b.  Distributed	
  File	
  Systems	
  (e.g.	
  Hadoop)	
  
c.  NoSql	
  databases	
  (e.g.	
  HBase)	
  
3.  Processing	
  Big	
  Data	
  (0945-­‐1000	
  and	
  1030-­‐1100)	
  
a.  Mul+ple	
  Virtual	
  Machines	
  &	
  Message	
  Queues	
  
b.  MapReduce	
  
c.  Streams	
  over	
  distributed	
  file	
  systems	
  
4.  Lab	
  using	
  Amazon’s	
  Elas+c	
  Map	
  Reduce	
  
(1100-­‐1200)	
  
	
  
What	
  Are	
  the	
  Choices?	
  
Databases	
  	
  
(SqlServer,	
  Oracle,	
  DB2)	
  
File	
  Systems	
  
Distributed	
  File	
  Systems	
  
(Hadoop,	
  Sector)	
  
Clustered	
  
File	
  Systems	
  
(glusterfs,	
  …)	
  
NoSQL	
  Databases	
  
(HBase,	
  Accumulo,	
  
Cassandra,	
  SimpleDB,	
  …)	
  
Applica+ons	
  	
  
(R,	
  SAS,	
  Excel,	
  etc.	
  )	
  
What	
  is	
  the	
  Fundamental	
  Trade	
  Off?	
  
Scale	
  up	
  
Scale	
  out	
  
vs	
   …	
  
2.1	
  	
  Databases	
  
Advice	
  From	
  Jim	
  Gray	
  
1.  Analyzing	
  big	
  data	
  requires	
  
scale-­‐out	
  solu+ons	
  not	
  scale-­‐up	
  
solu+ons	
  (GrayWulf)	
  
2.  Move	
  the	
  analysis	
  to	
  the	
  data.	
  
3.  Work	
  with	
  scien+sts	
  to	
  find	
  the	
  
most	
  common	
  “20	
  queries”	
  and	
  
make	
  them	
  fast.	
  
4.  Go	
  from	
  “working	
  to	
  working.”	
  
PaCern	
  1:	
  Put	
  the	
  metadata	
  in	
  a	
  
database	
  and	
  point	
  to	
  files	
  in	
  a	
  
file	
  system.	
  	
  
Example:	
  Sloan	
  Digital	
  Sky	
  Survey	
  
•  Two	
  surveys	
  in	
  one	
  
– Photometric	
  survey	
  in	
  5	
  bands	
  
– Spectroscopic	
  redshii	
  survey	
  
•  Data	
  is	
  public	
  
– 40	
  TB	
  of	
  raw	
  data	
  
– 5	
  TB	
  processed	
  catalogs	
  
– 2.5	
  Terapixels	
  of	
  images	
  
•  Catalog	
  uses	
  Microsoi	
  SQLServer	
  
•  Started	
  in	
  1992,	
  finished	
  in	
  2008	
  
•  JHU	
  SkyServer	
  serves	
  millions	
  of	
  queries	
  	
  
Example:	
  Bionimbus	
  Genomics	
  Cloud	
  
www.bionimbus.org	
  
Database	
  
Services	
  
Analysis	
  Pipelines	
  
&	
  Re-­‐analysis	
  
Services	
  
GWT-­‐based	
  Front	
  End	
  
Data	
  	
  
Cloud	
  Services	
  
Data	
  
Inges+on	
  
Services	
  
U+lity	
  Cloud	
  
Services	
  
Intercloud	
  
Services	
  
Database	
  
Services	
  
Analysis	
  Pipelines	
  
&	
  Re-­‐analysis	
  
Services	
  
GWT-­‐based	
  Front	
  End	
  
Large	
  Data	
  	
  
Cloud	
  Services	
  
Data	
  
Inges+on	
  
Services	
  
Elas+c	
  Cloud	
  
Services	
  
Intercloud	
  
Services	
  
(Hadoop,	
  
Sector/Sphere)	
  
(Eucalyptus,	
  
OpenStack)	
  
(PostgreSQL)	
  
ID	
  Service	
  
(UDT,	
  
replica+on)	
  
Sec+on	
  2.2	
  
Distributed	
  File	
  Systems	
  
Sector/Sphere	
  
Hadoop’s	
  Large	
  Data	
  Cloud	
  
Storage	
  Services	
  
Compute	
  Services	
  
13
Hadoop’s	
  Stack	
  
Applica+ons	
  
Hadoop	
  Distributed	
  File	
  
System	
  (HDFS)	
  
Hadoop’s	
  MapReduce	
  
Data	
  Services	
   NoSQL	
  Databases	
  
PaCern	
  2:	
  Put	
  the	
  data	
  into	
  a	
  
distributed	
  file	
  system.	
  
Hadoop	
  Design	
  
•  Designed	
  to	
  run	
  over	
  commodity	
  components	
  
that	
  fail.	
  
•  Data	
  replicated,	
  typically	
  three	
  +mes.	
  
•  Block-­‐based	
  storage.	
  
•  Op+mized	
  for	
  efficient	
  scans	
  with	
  high	
  
throughput,	
  not	
  low	
  latency	
  access.	
  
•  Designed	
  for	
  write	
  once,	
  read	
  many.	
  
•  Append	
  opera+on	
  planned	
  for	
  future.	
  
Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  	
  
Architecture	
  
Name	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
Client	
  
control	
  
Data	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
data	
  
Rack	
   Rack	
   Rack	
  
•  HDFS	
  is	
  block-­‐
based.	
  
•  WriCen	
  in	
  Java.	
  
Sector	
  Distributed	
  File	
  System	
  (SDFS)	
  
Architecture	
  
•  Broadly	
  similar	
  to	
  Google	
  File	
  System	
  and	
  
Hadoop	
  Distributed	
  File	
  System.	
  
•  Uses	
  na+ve	
  file	
  system.	
  	
  It	
  is	
  not	
  block	
  based.	
  
•  Has	
  security	
  server	
  that	
  provides	
  
authoriza+ons.	
  
•  Has	
  mul+ple	
  master	
  name	
  servers	
  so	
  that	
  
there	
  is	
  no	
  single	
  point	
  of	
  failure.	
  
•  Use	
  UDT	
  to	
  support	
  wide	
  area	
  opera+ons.	
  
Sector	
  Distributed	
  File	
  System	
  (SDFS)	
  	
  
Architecture	
  
Master	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
Client	
  
control	
  
Slave	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
data	
  
Rack	
   Rack	
   Rack	
  
•  HDFS	
  is	
  file-­‐
based.	
  
•  WriCen	
  in	
  C++.	
  
•  Security	
  server.	
  
•  Mul+ple	
  masters.	
  
Security	
  Server	
  
control	
  
Master	
  Node	
  
GlusterFS	
  Architecture	
  
•  No	
  metadata	
  server.	
  
•  No	
  single	
  point	
  of	
  failure.	
  
•  Uses	
  algorithms	
  to	
  determine	
  loca+on	
  of	
  data.	
  
•  Can	
  scale	
  out	
  by	
  adding	
  more	
  bricks.	
  
GlusterFS	
  Architecture	
  
Brick	
  
Brick	
  
Brick	
  
Client	
  
Brick	
  
Brick	
  
Brick	
  
data	
  
Rack	
   Rack	
   Rack	
  
File-­‐based.	
  
GlusterFS	
  Server	
  
Sec+on	
  2.3	
  
NoSQL	
  Databases	
  
21	
  
Evolu+on	
  
•  Standard	
  architecture	
  for	
  simple	
  web	
  
applica+ons:	
  
– Presenta+on:	
  front-­‐end,	
  load	
  balanced	
  web	
  servers	
  
– Business	
  logic	
  layer	
  	
  
– Backend	
  database	
  
•  Database	
  layer	
  does	
  not	
  scale	
  with	
  large	
  
numbers	
  of	
  users	
  or	
  large	
  amounts	
  of	
  data	
  
•  Alterna+ves	
  arose	
  
– Sharded	
  (par++oned)	
  databases	
  or	
  master-­‐slave	
  dbs	
  
– memcache	
  
22	
  
Scaling	
  RDMS	
  
•  Master	
  –	
  slave	
  database	
  systems	
  
– Writes	
  to	
  master	
  
– Reads	
  from	
  slaves	
  
– Can	
  be	
  boClenecks	
  wri+ng	
  to	
  slaves;	
  can	
  be	
  
inconsistent	
  
•  Sharded	
  databases	
  
– Applica+ons	
  and	
  queries	
  must	
  understand	
  sharing	
  
schema	
  
– Both	
  reads	
  and	
  writes	
  scale	
  
– No	
  na+ve,	
  direct	
  support	
  for	
  joins	
  across	
  shards	
  
23	
  
NoSQL	
  Systems	
  
•  Suggests	
  No	
  SQL	
  support,	
  also	
  Not	
  Only	
  SQL	
  
•  One	
  or	
  more	
  of	
  the	
  ACID	
  proper+es	
  not	
  
supported	
  
•  Joins	
  generally	
  not	
  supported	
  
•  Usually	
  flexible	
  schemas	
  
•  Some	
  well	
  known	
  examples:	
  Google’s	
  BigTable,	
  
Amazon’s	
  Dynamo	
  &	
  Facebook’s	
  Cassandra	
  
•  Quite	
  a	
  few	
  recent	
  open	
  source	
  systems	
  
24	
  
PaCern	
  3:	
  Put	
  the	
  data	
  into	
  a	
  
NoSQL	
  applica+on.	
  
26	
  
C	
  
A	
   P	
  
Consistency	
  
Availability	
   Par++on-­‐resiliency	
  
CA:	
  available	
  and	
  
consistent,	
  unless	
  there	
  
is	
  a	
  par++on.	
  
AP:	
  a	
  reachable	
  replica	
  
provides	
  service	
  even	
  in	
  a	
  
par++on,	
  but	
  may	
  be	
  
inconsistent.	
  
CP:	
  always	
  consistent,	
  even	
  in	
  a	
  
par++on,	
  but	
  a	
  reachable	
  replica	
  
may	
  deny	
  service	
  without	
  
quorum.	
  
Dynamo,	
  Cassandra	
  	
  
BigTable,	
  
HBase	
  
CAP	
  –	
  Choose	
  Two	
  
Per	
  Opera+on	
  
CAP	
  Theorem	
  
•  Proposed	
  by	
  Eric	
  Brewer,	
  2000	
  
•  Three	
  proper+es	
  of	
  a	
  system:	
  consistency,	
  
availability	
  and	
  par++ons	
  
•  You	
  can	
  have	
  at	
  most	
  two	
  of	
  these	
  three	
  
proper+es	
  for	
  any	
  shared-­‐data	
  system	
  
•  Scale	
  out	
  requires	
  par++ons	
  
•  Most	
  large	
  web-­‐based	
  systems	
  choose	
  
availability	
  over	
  consistency	
  
28	
  Reference:	
  Brewer,	
  PODC	
  2000;	
  Gilbert/Lynch,	
  SIGACT	
  News	
  2002	
  
Eventual	
  Consistency	
  
•  If	
  no	
  updates	
  occur	
  for	
  a	
  while,	
  all	
  updates	
  
eventually	
  propagate	
  through	
  the	
  system	
  and	
  
all	
  the	
  nodes	
  will	
  be	
  consistent	
  
•  Eventually,	
  a	
  node	
  is	
  either	
  updated	
  or	
  
removed	
  from	
  service.	
  	
  	
  
•  Can	
  be	
  implemented	
  with	
  Gossip	
  protocol	
  
•  Amazon’s	
  Dynamo	
  popularized	
  this	
  approach	
  
•  Some+mes	
  this	
  is	
  called	
  BASE	
  (Basically	
  
Available,	
  Soi	
  state,	
  Eventual	
  consistency),	
  as	
  
opposed	
  to	
  ACID	
   29	
  
Different	
  Types	
  of	
  NoSQL	
  Systems	
  
•  Distributed	
  Key-­‐Value	
  Systems	
  
–  Amazon’s	
  S3	
  Key-­‐Value	
  Store	
  (Dynamo)	
  
–  Voldemort	
  
–  Cassandra	
  
•  Column-­‐based	
  Systems	
  
–  BigTable	
  
–  HBase	
  
–  Cassandra	
  
•  Document-­‐based	
  systems	
  
–  CouchDB	
  
30	
  
Hbase	
  Architecture	
  
HRegionServer	
  
Client	
   Client	
   Client	
   Client	
  Client	
  
HBaseMaster	
  
REST API
Disk	
  
HRegionServer	
  
Java	
  Client	
  
Disk	
  
HRegionServer	
  
Disk	
  
HRegionServer	
  
Disk	
  
HRegionServer	
  
Source:	
  Raghu	
  Ramakrishnan	
  
HRegion	
  Server	
  
•  Records	
  par++oned	
  by	
  column	
  family	
  into	
  HStores	
  
–  Each	
  HStore	
  contains	
  many	
  MapFiles	
  
•  All	
  writes	
  to	
  HStore	
  applied	
  to	
  single	
  memcache	
  
•  Reads	
  consult	
  MapFiles	
  and	
  memcache	
  
•  Memcaches	
  flushed	
  as	
  MapFiles	
  (HDFS	
  files)	
  when	
  full	
  
•  Compac+ons	
  limit	
  number	
  of	
  MapFiles	
  
HRegionServer	
  
HStore	
  
MapFiles	
  
Memcache	
  writes	
  
Flush	
  to	
  disk	
  
reads	
  
Source:	
  Raghu	
  Ramakrishnan	
  
Facebook’s	
  Cassandra	
  
•  Modeled	
  aier	
  BigTable’s	
  data	
  model	
  
•  Modeled	
  aier	
  Dynamo’s	
  eventual	
  consistency	
  
•  Peer	
  to	
  peer	
  storage	
  architecture	
  using	
  
consistent	
  hashing	
  (Chord	
  hashing)	
  
33	
  
Databases	
   NoSQL	
  Systems	
  
Scalability	
   100’s	
  TB	
   100’s	
  PB	
  
Func+onality	
   Full	
  SQL-­‐based	
  queries,	
  
including	
  joins	
  
Op+mized	
  access	
  to	
  
sorted	
  tables	
  (tables	
  with	
  
single	
  keys)	
  
Op+mized	
   Databases	
  op+mized	
  
for	
  safe	
  writes	
  
Clouds	
  op+mized	
  for	
  
efficient	
  reads	
  
Consistency	
  
model	
  
ACID	
  (Atomicity,	
  
Consistency,	
  Isola+on	
  
&	
  Durability)	
  –	
  
database	
  always	
  
consist	
  
Eventual	
  consistency	
  –	
  
updates	
  eventually	
  
propagate	
  through	
  
system	
  
Parallelism	
   Difficult	
  because	
  of	
  
ACID	
  model;	
  shared	
  
nothing	
  is	
  possible	
  
Basic	
  design	
  incorporates	
  
parallelism	
  over	
  
commodity	
  components	
  	
  
Scale	
   Racks	
   Data	
  center	
   34	
  
Sec+on	
  2.3	
  	
  
Case	
  Study:	
  Project	
  Matsu	
  
Zoom	
  Levels	
  /	
  Bounds	
  
Zoom	
  Level	
  1:	
  4	
  images	
   Zoom	
  Level	
  2:	
  16	
  images	
  
Zoom	
  Level	
  3:	
  64	
  images	
   Zoom	
  Level	
  4:	
  256	
  images	
  
Source:	
  Andrew	
  Levine	
  
Mapper	
  Input	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Input	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  resizes	
  and/or	
  cuts	
  up	
  the	
  original	
  
image	
  into	
  pieces	
  to	
  output	
  Bounding	
  Boxes	
  
(minx	
  =	
  -­‐135.0	
  miny	
  =	
  45.0	
  maxx	
  =	
  -­‐112.5	
  maxy	
  =	
  67.5)	
  
Step	
  1:	
  Input	
  to	
  Mapper	
  
Step	
  2:	
  Processing	
  in	
  Mapper	
   Step	
  3:	
  Mapper	
  Output	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Build	
  Tile	
  Cache	
  in	
  the	
  Cloud	
  -­‐	
  Mapper	
  
Source:	
  Andrew	
  Levine	
  
Reducer	
  Key	
  Input:	
  Bounding	
  Box	
  
(minx	
  =	
  -­‐45.0	
  miny	
  =	
  -­‐2.8125	
  maxx	
  =	
  -­‐43.59375	
  maxy	
  =	
  -­‐2.109375)	
  
Reducer	
  Value	
  Input:	
  
Step	
  1:	
  Input	
  to	
  Reducer	
  
…	
  
Step	
  2:	
  Reducer	
  Output	
  
Assemble	
  Images	
  based	
  on	
  bounding	
  box	
  
•  Output	
  to	
  HBase	
  
•  Builds	
  up	
  Layers	
  
for	
  WMS	
  for	
  
various	
  datasets	
  
Build	
  Tile	
  Cache	
  in	
  the	
  Cloud	
  -­‐	
  Reducer	
  
Source:	
  Andrew	
  Levine	
  
HBase	
  Tables	
  
•  Open	
  Geospa+al	
  Consor+um	
  (OGC)	
  Web	
  
Mapping	
  Service	
  (WMS)	
  Query	
  translates	
  to	
  
HBase	
  scheme	
  
– Layers,	
  Styles,	
  Projec+on,	
  Size	
  
•  Table	
  name:	
  WMS	
  Layer	
  
– Row	
  ID:	
  Bounding	
  Box	
  of	
  image	
  
-­‐Column	
  Family:	
  Style	
  Name	
  and	
  Projec+on	
  
	
  	
  	
  -­‐Column	
  Qualifier:	
  Width	
  x	
  Height	
  
	
  	
  	
  	
  	
  	
  -­‐Value:	
  Buffered	
  Image	
  
Sec+on	
  2.4	
  
Distributed	
  Key-­‐Value	
  Stores	
  
S3	
  
PaCern	
  4:	
  Put	
  the	
  data	
  into	
  a	
  
distributed	
  key-­‐value	
  store.	
  
S3	
  Buckets	
  
•  S3	
  bucket	
  names	
  must	
  be	
  unique	
  across	
  AWS	
  
•  A	
  good	
  prac+ce	
  is	
  to	
  use	
  a	
  paCern	
  like	
  
	
   	
  tutorial.osdc.org/dataset1.txt	
  
for	
  a	
  domain	
  you	
  own.	
  
•  The	
  file	
  is	
  then	
  referenced	
  as	
  
	
  tutorial.osdc.org.s3.	
  amazonaws.com/
dataset1.txt	
  
•  If	
  you	
  own	
  osdc.org	
  you	
  can	
  create	
  a	
  DNS	
  
CNAME	
  entry	
  to	
  access	
  the	
  file	
  as	
  
tutorial.osdc.org/dataset1.txt	
  
S3	
  Keys	
  
•  Keys	
  must	
  be	
  unique	
  within	
  a	
  bucket.	
  
•  Values	
  can	
  be	
  as	
  large	
  as	
  5	
  TB	
  (formerly	
  5	
  GB)	
  
S3	
  Security	
  
•  AWS	
  access	
  key	
  (user	
  name)	
  
•  This	
  func+on	
  as	
  your	
  S3	
  username.	
  It	
  is	
  an	
  
alphanumeric	
  text	
  string	
  that	
  uniquely	
  
iden+fies	
  users.	
  	
  
•  AWS	
  Secret	
  key	
  (func+ons	
  as	
  password)	
  
AWS	
  Account	
  Informa+on	
  
Access	
  Keys	
  
User	
  Name	
   Password	
  
Other	
  Amazon	
  Data	
  Services	
  
•  Amazon	
  Simple	
  Database	
  Service	
  (SDS)	
  
•  Amazon’s	
  Elas+c	
  Block	
  Storage	
  (EBS)	
  
Sec+on	
  2.5	
  
Moving	
  Large	
  Data	
  Sets	
  
The	
  Basic	
  Problem	
  
•  TCP	
  was	
  never	
  designed	
  to	
  move	
  large	
  data	
  
sets	
  over	
  wide	
  area	
  high	
  performance	
  
networks.	
  
•  As	
  a	
  general	
  rule,	
  reading	
  data	
  off	
  disks	
  is	
  
slower	
  than	
  transpor+ng	
  it	
  over	
  the	
  network.	
  	
  	
  
TCP Throughput vs RTT and Packet Loss
0.01%
0.05%
0.1%
0.1%
0.5%
1000
800
600
400
200
1 10 100 200 400
1000
800
600
400
200
Throughput(Mb/s)
Round Trip Time (ms)
LAN US-EU US-ASIAUS
Source:	
  Yunhong	
  Gu,	
  	
  2007,	
  experiments	
  over	
  wide	
  area	
  1G.	
  
The	
  Solu+on	
  
•  Use	
  parallel	
  TCP	
  streams	
  
– GridFTP	
  
•  Use	
  specialized	
  network	
  protocols	
  
– UDT,	
  FAST,	
  etc.	
  
•  Use	
  RAID	
  to	
  stripe	
  data	
  across	
  disks	
  to	
  
improve	
  throughput	
  when	
  reading	
  
•  These	
  techniques	
  are	
  well	
  understood	
  in	
  HEP,	
  
astronomy,	
  but	
  not	
  yet	
  in	
  biology.	
  
Case	
  Study:	
  Bio-­‐mirror	
  
[The	
  open	
  source	
  GridFTP]	
  from	
  the	
  Globus	
  project	
  has	
  
recently	
  been	
  improved	
  to	
  offer	
  UDP-­‐based	
  file	
  transport,	
  
with	
  long-­‐distance	
  speed	
  improvements	
  of	
  3x	
  to	
  10x	
  over	
  
the	
  usual	
  TCP-­‐based	
  file	
  transport.	
  
	
  
-­‐-­‐	
  Don	
  Gilbert,	
  August	
  2010,	
  bio-­‐mirror.net	
  
Moving	
  113GB	
  of	
  Bio-­‐mirror	
  Data	
  
Site	
   RTT	
   TCP	
   UDT	
   TCP/UDT	
   Km	
  
NCSA	
   10	
   139	
   139	
   1	
   200	
  
Purdue	
   17	
   125	
   125	
   1	
   500	
  
ORNL	
   25	
   361	
   120	
   3	
   1,200	
  
TACC	
   37	
   616	
   120	
   55	
   2,000	
  
SDSC	
   65	
   750	
   475	
   1.6	
   3,300	
  
CSTNET	
   274	
   3722	
   304	
   12	
   12,000	
  
GridFTP	
  TCP	
  and	
  UDT	
  transfer	
  +mes	
  for	
  113	
  GB	
  from	
  gridip.bio-­‐mirror.net/biomirror/
blast/	
  (Indiana	
  USA).	
  	
  All	
  TCP	
  and	
  UDT	
  +mes	
  in	
  minutes.	
  	
  Source:	
  	
  hCp://gridip.bio-­‐
mirror.net/biomirror/	
  
Case	
  Study:	
  CGI	
  60	
  Genomes	
  
•  Trace	
  by	
  Complete	
  Genomics	
  showing	
  performance	
  of	
  
moving	
  60	
  complete	
  human	
  genomes	
  from	
  Mountain	
  
View	
  to	
  Chicago	
  using	
  the	
  open	
  source	
  Sector/UDT.	
  
•  Approximately	
  18	
  TB	
  at	
  about	
  0.5	
  Mbs	
  on	
  1G	
  link.	
  
Source:	
  Complete	
  Genomics.	
  	
  	
  
Resource	
  Use	
  
Protocol	
   CPU	
  Usage*	
   Memory*	
  
GridFTP	
  (UDT)	
   1.0%	
  -­‐	
  3.0%	
  	
   40	
  Mb	
  
GridFTP	
  (TCP)	
   0.1%	
  -­‐	
  0.6%	
   6	
  Mb	
  
*CPU	
  and	
  memory	
  usage	
  collected	
  by	
  	
  Don	
  Gilbert.	
  	
  	
  He	
  reports	
  that	
  rsync	
  uses	
  more	
  
CPU	
  than	
  GridFTP	
  with	
  UDT.	
  	
  	
  Source:	
  hCp://gridip.bio-­‐mirror.net/biomirror/.	
  
Sector/Sphere	
  
•  Sector/Sphere	
  is	
  a	
  pla{orm	
  for	
  data	
  intensive	
  
compu+ng	
  built	
  over	
  UDT	
  and	
  designed	
  to	
  
support	
  geographically	
  distributed	
  clusters.	
  	
  
Ques+ons?	
  
For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  see	
  
rgrossman.com	
  

Managing Big Data: An Introduction to Data Intensive Computing

  • 1.
    An  Introduc+on  to     Data  Intensive  Compu+ng     Chapter  2:  Data  Management   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
  • 2.
    1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)   2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)   3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems   4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3.
    What  Are  the  Choices?   Databases     (SqlServer,  Oracle,  DB2)   File  Systems   Distributed  File  Systems   (Hadoop,  Sector)   Clustered   File  Systems   (glusterfs,  …)   NoSQL  Databases   (HBase,  Accumulo,   Cassandra,  SimpleDB,  …)   Applica+ons     (R,  SAS,  Excel,  etc.  )  
  • 4.
    What  is  the  Fundamental  Trade  Off?   Scale  up   Scale  out   vs   …  
  • 5.
  • 6.
    Advice  From  Jim  Gray   1.  Analyzing  big  data  requires   scale-­‐out  solu+ons  not  scale-­‐up   solu+ons  (GrayWulf)   2.  Move  the  analysis  to  the  data.   3.  Work  with  scien+sts  to  find  the   most  common  “20  queries”  and   make  them  fast.   4.  Go  from  “working  to  working.”  
  • 7.
    PaCern  1:  Put  the  metadata  in  a   database  and  point  to  files  in  a   file  system.    
  • 8.
    Example:  Sloan  Digital  Sky  Survey   •  Two  surveys  in  one   – Photometric  survey  in  5  bands   – Spectroscopic  redshii  survey   •  Data  is  public   – 40  TB  of  raw  data   – 5  TB  processed  catalogs   – 2.5  Terapixels  of  images   •  Catalog  uses  Microsoi  SQLServer   •  Started  in  1992,  finished  in  2008   •  JHU  SkyServer  serves  millions  of  queries    
  • 9.
    Example:  Bionimbus  Genomics  Cloud   www.bionimbus.org  
  • 10.
    Database   Services   Analysis  Pipelines   &  Re-­‐analysis   Services   GWT-­‐based  Front  End   Data     Cloud  Services   Data   Inges+on   Services   U+lity  Cloud   Services   Intercloud   Services  
  • 11.
    Database   Services   Analysis  Pipelines   &  Re-­‐analysis   Services   GWT-­‐based  Front  End   Large  Data     Cloud  Services   Data   Inges+on   Services   Elas+c  Cloud   Services   Intercloud   Services   (Hadoop,   Sector/Sphere)   (Eucalyptus,   OpenStack)   (PostgreSQL)   ID  Service   (UDT,   replica+on)  
  • 12.
    Sec+on  2.2   Distributed  File  Systems   Sector/Sphere  
  • 13.
    Hadoop’s  Large  Data  Cloud   Storage  Services   Compute  Services   13 Hadoop’s  Stack   Applica+ons   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  MapReduce   Data  Services   NoSQL  Databases  
  • 14.
    PaCern  2:  Put  the  data  into  a   distributed  file  system.  
  • 15.
    Hadoop  Design   • Designed  to  run  over  commodity  components   that  fail.   •  Data  replicated,  typically  three  +mes.   •  Block-­‐based  storage.   •  Op+mized  for  efficient  scans  with  high   throughput,  not  low  latency  access.   •  Designed  for  write  once,  read  many.   •  Append  opera+on  planned  for  future.  
  • 16.
    Hadoop  Distributed  File  System  (HDFS)     Architecture   Name  Node   Data  Node   Data  Node   Data  Node   Client   control   Data  Node   Data  Node   Data  Node   data   Rack   Rack   Rack   •  HDFS  is  block-­‐ based.   •  WriCen  in  Java.  
  • 17.
    Sector  Distributed  File  System  (SDFS)   Architecture   •  Broadly  similar  to  Google  File  System  and   Hadoop  Distributed  File  System.   •  Uses  na+ve  file  system.    It  is  not  block  based.   •  Has  security  server  that  provides   authoriza+ons.   •  Has  mul+ple  master  name  servers  so  that   there  is  no  single  point  of  failure.   •  Use  UDT  to  support  wide  area  opera+ons.  
  • 18.
    Sector  Distributed  File  System  (SDFS)     Architecture   Master  Node   Slave  Node   Slave  Node   Slave  Node   Client   control   Slave  Node   Slave  Node   Slave  Node   data   Rack   Rack   Rack   •  HDFS  is  file-­‐ based.   •  WriCen  in  C++.   •  Security  server.   •  Mul+ple  masters.   Security  Server   control   Master  Node  
  • 19.
    GlusterFS  Architecture   • No  metadata  server.   •  No  single  point  of  failure.   •  Uses  algorithms  to  determine  loca+on  of  data.   •  Can  scale  out  by  adding  more  bricks.  
  • 20.
    GlusterFS  Architecture   Brick   Brick   Brick   Client   Brick   Brick   Brick   data   Rack   Rack   Rack   File-­‐based.   GlusterFS  Server  
  • 21.
    Sec+on  2.3   NoSQL  Databases   21  
  • 22.
    Evolu+on   •  Standard  architecture  for  simple  web   applica+ons:   – Presenta+on:  front-­‐end,  load  balanced  web  servers   – Business  logic  layer     – Backend  database   •  Database  layer  does  not  scale  with  large   numbers  of  users  or  large  amounts  of  data   •  Alterna+ves  arose   – Sharded  (par++oned)  databases  or  master-­‐slave  dbs   – memcache   22  
  • 23.
    Scaling  RDMS   • Master  –  slave  database  systems   – Writes  to  master   – Reads  from  slaves   – Can  be  boClenecks  wri+ng  to  slaves;  can  be   inconsistent   •  Sharded  databases   – Applica+ons  and  queries  must  understand  sharing   schema   – Both  reads  and  writes  scale   – No  na+ve,  direct  support  for  joins  across  shards   23  
  • 24.
    NoSQL  Systems   • Suggests  No  SQL  support,  also  Not  Only  SQL   •  One  or  more  of  the  ACID  proper+es  not   supported   •  Joins  generally  not  supported   •  Usually  flexible  schemas   •  Some  well  known  examples:  Google’s  BigTable,   Amazon’s  Dynamo  &  Facebook’s  Cassandra   •  Quite  a  few  recent  open  source  systems   24  
  • 25.
    PaCern  3:  Put  the  data  into  a   NoSQL  applica+on.  
  • 26.
  • 27.
    C   A  P   Consistency   Availability   Par++on-­‐resiliency   CA:  available  and   consistent,  unless  there   is  a  par++on.   AP:  a  reachable  replica   provides  service  even  in  a   par++on,  but  may  be   inconsistent.   CP:  always  consistent,  even  in  a   par++on,  but  a  reachable  replica   may  deny  service  without   quorum.   Dynamo,  Cassandra     BigTable,   HBase   CAP  –  Choose  Two   Per  Opera+on  
  • 28.
    CAP  Theorem   • Proposed  by  Eric  Brewer,  2000   •  Three  proper+es  of  a  system:  consistency,   availability  and  par++ons   •  You  can  have  at  most  two  of  these  three   proper+es  for  any  shared-­‐data  system   •  Scale  out  requires  par++ons   •  Most  large  web-­‐based  systems  choose   availability  over  consistency   28  Reference:  Brewer,  PODC  2000;  Gilbert/Lynch,  SIGACT  News  2002  
  • 29.
    Eventual  Consistency   • If  no  updates  occur  for  a  while,  all  updates   eventually  propagate  through  the  system  and   all  the  nodes  will  be  consistent   •  Eventually,  a  node  is  either  updated  or   removed  from  service.       •  Can  be  implemented  with  Gossip  protocol   •  Amazon’s  Dynamo  popularized  this  approach   •  Some+mes  this  is  called  BASE  (Basically   Available,  Soi  state,  Eventual  consistency),  as   opposed  to  ACID   29  
  • 30.
    Different  Types  of  NoSQL  Systems   •  Distributed  Key-­‐Value  Systems   –  Amazon’s  S3  Key-­‐Value  Store  (Dynamo)   –  Voldemort   –  Cassandra   •  Column-­‐based  Systems   –  BigTable   –  HBase   –  Cassandra   •  Document-­‐based  systems   –  CouchDB   30  
  • 31.
    Hbase  Architecture   HRegionServer   Client   Client   Client   Client  Client   HBaseMaster   REST API Disk   HRegionServer   Java  Client   Disk   HRegionServer   Disk   HRegionServer   Disk   HRegionServer   Source:  Raghu  Ramakrishnan  
  • 32.
    HRegion  Server   • Records  par++oned  by  column  family  into  HStores   –  Each  HStore  contains  many  MapFiles   •  All  writes  to  HStore  applied  to  single  memcache   •  Reads  consult  MapFiles  and  memcache   •  Memcaches  flushed  as  MapFiles  (HDFS  files)  when  full   •  Compac+ons  limit  number  of  MapFiles   HRegionServer   HStore   MapFiles   Memcache  writes   Flush  to  disk   reads   Source:  Raghu  Ramakrishnan  
  • 33.
    Facebook’s  Cassandra   • Modeled  aier  BigTable’s  data  model   •  Modeled  aier  Dynamo’s  eventual  consistency   •  Peer  to  peer  storage  architecture  using   consistent  hashing  (Chord  hashing)   33  
  • 34.
    Databases   NoSQL  Systems   Scalability   100’s  TB   100’s  PB   Func+onality   Full  SQL-­‐based  queries,   including  joins   Op+mized  access  to   sorted  tables  (tables  with   single  keys)   Op+mized   Databases  op+mized   for  safe  writes   Clouds  op+mized  for   efficient  reads   Consistency   model   ACID  (Atomicity,   Consistency,  Isola+on   &  Durability)  –   database  always   consist   Eventual  consistency  –   updates  eventually   propagate  through   system   Parallelism   Difficult  because  of   ACID  model;  shared   nothing  is  possible   Basic  design  incorporates   parallelism  over   commodity  components     Scale   Racks   Data  center   34  
  • 35.
    Sec+on  2.3     Case  Study:  Project  Matsu  
  • 36.
    Zoom  Levels  /  Bounds   Zoom  Level  1:  4  images   Zoom  Level  2:  16  images   Zoom  Level  3:  64  images   Zoom  Level  4:  256  images   Source:  Andrew  Levine  
  • 37.
    Mapper  Input  Key:  Bounding  Box   Mapper  Input  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  resizes  and/or  cuts  up  the  original   image  into  pieces  to  output  Bounding  Boxes   (minx  =  -­‐135.0  miny  =  45.0  maxx  =  -­‐112.5  maxy  =  67.5)   Step  1:  Input  to  Mapper   Step  2:  Processing  in  Mapper   Step  3:  Mapper  Output   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Build  Tile  Cache  in  the  Cloud  -­‐  Mapper   Source:  Andrew  Levine  
  • 38.
    Reducer  Key  Input:  Bounding  Box   (minx  =  -­‐45.0  miny  =  -­‐2.8125  maxx  =  -­‐43.59375  maxy  =  -­‐2.109375)   Reducer  Value  Input:   Step  1:  Input  to  Reducer   …   Step  2:  Reducer  Output   Assemble  Images  based  on  bounding  box   •  Output  to  HBase   •  Builds  up  Layers   for  WMS  for   various  datasets   Build  Tile  Cache  in  the  Cloud  -­‐  Reducer   Source:  Andrew  Levine  
  • 39.
    HBase  Tables   • Open  Geospa+al  Consor+um  (OGC)  Web   Mapping  Service  (WMS)  Query  translates  to   HBase  scheme   – Layers,  Styles,  Projec+on,  Size   •  Table  name:  WMS  Layer   – Row  ID:  Bounding  Box  of  image   -­‐Column  Family:  Style  Name  and  Projec+on        -­‐Column  Qualifier:  Width  x  Height              -­‐Value:  Buffered  Image  
  • 40.
    Sec+on  2.4   Distributed  Key-­‐Value  Stores   S3  
  • 41.
    PaCern  4:  Put  the  data  into  a   distributed  key-­‐value  store.  
  • 42.
    S3  Buckets   • S3  bucket  names  must  be  unique  across  AWS   •  A  good  prac+ce  is  to  use  a  paCern  like      tutorial.osdc.org/dataset1.txt   for  a  domain  you  own.   •  The  file  is  then  referenced  as    tutorial.osdc.org.s3.  amazonaws.com/ dataset1.txt   •  If  you  own  osdc.org  you  can  create  a  DNS   CNAME  entry  to  access  the  file  as   tutorial.osdc.org/dataset1.txt  
  • 43.
    S3  Keys   • Keys  must  be  unique  within  a  bucket.   •  Values  can  be  as  large  as  5  TB  (formerly  5  GB)  
  • 44.
    S3  Security   • AWS  access  key  (user  name)   •  This  func+on  as  your  S3  username.  It  is  an   alphanumeric  text  string  that  uniquely   iden+fies  users.     •  AWS  Secret  key  (func+ons  as  password)  
  • 45.
  • 46.
    Access  Keys   User  Name   Password  
  • 47.
    Other  Amazon  Data  Services   •  Amazon  Simple  Database  Service  (SDS)   •  Amazon’s  Elas+c  Block  Storage  (EBS)  
  • 48.
    Sec+on  2.5   Moving  Large  Data  Sets  
  • 49.
    The  Basic  Problem   •  TCP  was  never  designed  to  move  large  data   sets  over  wide  area  high  performance   networks.   •  As  a  general  rule,  reading  data  off  disks  is   slower  than  transpor+ng  it  over  the  network.      
  • 50.
    TCP Throughput vsRTT and Packet Loss 0.01% 0.05% 0.1% 0.1% 0.5% 1000 800 600 400 200 1 10 100 200 400 1000 800 600 400 200 Throughput(Mb/s) Round Trip Time (ms) LAN US-EU US-ASIAUS Source:  Yunhong  Gu,    2007,  experiments  over  wide  area  1G.  
  • 51.
    The  Solu+on   • Use  parallel  TCP  streams   – GridFTP   •  Use  specialized  network  protocols   – UDT,  FAST,  etc.   •  Use  RAID  to  stripe  data  across  disks  to   improve  throughput  when  reading   •  These  techniques  are  well  understood  in  HEP,   astronomy,  but  not  yet  in  biology.  
  • 52.
    Case  Study:  Bio-­‐mirror   [The  open  source  GridFTP]  from  the  Globus  project  has   recently  been  improved  to  offer  UDP-­‐based  file  transport,   with  long-­‐distance  speed  improvements  of  3x  to  10x  over   the  usual  TCP-­‐based  file  transport.     -­‐-­‐  Don  Gilbert,  August  2010,  bio-­‐mirror.net  
  • 53.
    Moving  113GB  of  Bio-­‐mirror  Data   Site   RTT   TCP   UDT   TCP/UDT   Km   NCSA   10   139   139   1   200   Purdue   17   125   125   1   500   ORNL   25   361   120   3   1,200   TACC   37   616   120   55   2,000   SDSC   65   750   475   1.6   3,300   CSTNET   274   3722   304   12   12,000   GridFTP  TCP  and  UDT  transfer  +mes  for  113  GB  from  gridip.bio-­‐mirror.net/biomirror/ blast/  (Indiana  USA).    All  TCP  and  UDT  +mes  in  minutes.    Source:    hCp://gridip.bio-­‐ mirror.net/biomirror/  
  • 54.
    Case  Study:  CGI  60  Genomes   •  Trace  by  Complete  Genomics  showing  performance  of   moving  60  complete  human  genomes  from  Mountain   View  to  Chicago  using  the  open  source  Sector/UDT.   •  Approximately  18  TB  at  about  0.5  Mbs  on  1G  link.   Source:  Complete  Genomics.      
  • 55.
    Resource  Use   Protocol   CPU  Usage*   Memory*   GridFTP  (UDT)   1.0%  -­‐  3.0%     40  Mb   GridFTP  (TCP)   0.1%  -­‐  0.6%   6  Mb   *CPU  and  memory  usage  collected  by    Don  Gilbert.      He  reports  that  rsync  uses  more   CPU  than  GridFTP  with  UDT.      Source:  hCp://gridip.bio-­‐mirror.net/biomirror/.  
  • 56.
    Sector/Sphere   •  Sector/Sphere  is  a  pla{orm  for  data  intensive   compu+ng  built  over  UDT  and  designed  to   support  geographically  distributed  clusters.    
  • 57.
    Ques+ons?   For  the  most  current  version  of  these  notes,  see   rgrossman.com