SlideShare a Scribd company logo
1 of 57
Download to read offline
An	
  Introduc+on	
  to	
  	
  
Data	
  Intensive	
  Compu+ng	
  
	
  
Chapter	
  2:	
  Data	
  Management	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
Open	
  Data	
  Group	
  
	
  
Collin	
  BenneC	
  
Open	
  Data	
  Group	
  
	
  
November	
  14,	
  2011	
  
1	
  
1.  Introduc+on	
  (0830-­‐0900)	
  
a.  Data	
  clouds	
  (e.g.	
  Hadoop)	
  
b.  U+lity	
  clouds	
  (e.g.	
  Amazon)	
  
2.  Managing	
  Big	
  Data	
  (0900-­‐0945)	
  
a.  Databases	
  
b.  Distributed	
  File	
  Systems	
  (e.g.	
  Hadoop)	
  
c.  NoSql	
  databases	
  (e.g.	
  HBase)	
  
3.  Processing	
  Big	
  Data	
  (0945-­‐1000	
  and	
  1030-­‐1100)	
  
a.  Mul+ple	
  Virtual	
  Machines	
  &	
  Message	
  Queues	
  
b.  MapReduce	
  
c.  Streams	
  over	
  distributed	
  file	
  systems	
  
4.  Lab	
  using	
  Amazon’s	
  Elas+c	
  Map	
  Reduce	
  
(1100-­‐1200)	
  
	
  
What	
  Are	
  the	
  Choices?	
  
Databases	
  	
  
(SqlServer,	
  Oracle,	
  DB2)	
  
File	
  Systems	
  
Distributed	
  File	
  Systems	
  
(Hadoop,	
  Sector)	
  
Clustered	
  
File	
  Systems	
  
(glusterfs,	
  …)	
  
NoSQL	
  Databases	
  
(HBase,	
  Accumulo,	
  
Cassandra,	
  SimpleDB,	
  …)	
  
Applica+ons	
  	
  
(R,	
  SAS,	
  Excel,	
  etc.	
  )	
  
What	
  is	
  the	
  Fundamental	
  Trade	
  Off?	
  
Scale	
  up	
  
Scale	
  out	
  
vs	
   …	
  
2.1	
  	
  Databases	
  
Advice	
  From	
  Jim	
  Gray	
  
1.  Analyzing	
  big	
  data	
  requires	
  
scale-­‐out	
  solu+ons	
  not	
  scale-­‐up	
  
solu+ons	
  (GrayWulf)	
  
2.  Move	
  the	
  analysis	
  to	
  the	
  data.	
  
3.  Work	
  with	
  scien+sts	
  to	
  find	
  the	
  
most	
  common	
  “20	
  queries”	
  and	
  
make	
  them	
  fast.	
  
4.  Go	
  from	
  “working	
  to	
  working.”	
  
PaCern	
  1:	
  Put	
  the	
  metadata	
  in	
  a	
  
database	
  and	
  point	
  to	
  files	
  in	
  a	
  
file	
  system.	
  	
  
Example:	
  Sloan	
  Digital	
  Sky	
  Survey	
  
•  Two	
  surveys	
  in	
  one	
  
– Photometric	
  survey	
  in	
  5	
  bands	
  
– Spectroscopic	
  redshii	
  survey	
  
•  Data	
  is	
  public	
  
– 40	
  TB	
  of	
  raw	
  data	
  
– 5	
  TB	
  processed	
  catalogs	
  
– 2.5	
  Terapixels	
  of	
  images	
  
•  Catalog	
  uses	
  Microsoi	
  SQLServer	
  
•  Started	
  in	
  1992,	
  finished	
  in	
  2008	
  
•  JHU	
  SkyServer	
  serves	
  millions	
  of	
  queries	
  	
  
Example:	
  Bionimbus	
  Genomics	
  Cloud	
  
www.bionimbus.org	
  
Database	
  
Services	
  
Analysis	
  Pipelines	
  
&	
  Re-­‐analysis	
  
Services	
  
GWT-­‐based	
  Front	
  End	
  
Data	
  	
  
Cloud	
  Services	
  
Data	
  
Inges+on	
  
Services	
  
U+lity	
  Cloud	
  
Services	
  
Intercloud	
  
Services	
  
Database	
  
Services	
  
Analysis	
  Pipelines	
  
&	
  Re-­‐analysis	
  
Services	
  
GWT-­‐based	
  Front	
  End	
  
Large	
  Data	
  	
  
Cloud	
  Services	
  
Data	
  
Inges+on	
  
Services	
  
Elas+c	
  Cloud	
  
Services	
  
Intercloud	
  
Services	
  
(Hadoop,	
  
Sector/Sphere)	
  
(Eucalyptus,	
  
OpenStack)	
  
(PostgreSQL)	
  
ID	
  Service	
  
(UDT,	
  
replica+on)	
  
Sec+on	
  2.2	
  
Distributed	
  File	
  Systems	
  
Sector/Sphere	
  
Hadoop’s	
  Large	
  Data	
  Cloud	
  
Storage	
  Services	
  
Compute	
  Services	
  
13
Hadoop’s	
  Stack	
  
Applica+ons	
  
Hadoop	
  Distributed	
  File	
  
System	
  (HDFS)	
  
Hadoop’s	
  MapReduce	
  
Data	
  Services	
   NoSQL	
  Databases	
  
PaCern	
  2:	
  Put	
  the	
  data	
  into	
  a	
  
distributed	
  file	
  system.	
  
Hadoop	
  Design	
  
•  Designed	
  to	
  run	
  over	
  commodity	
  components	
  
that	
  fail.	
  
•  Data	
  replicated,	
  typically	
  three	
  +mes.	
  
•  Block-­‐based	
  storage.	
  
•  Op+mized	
  for	
  efficient	
  scans	
  with	
  high	
  
throughput,	
  not	
  low	
  latency	
  access.	
  
•  Designed	
  for	
  write	
  once,	
  read	
  many.	
  
•  Append	
  opera+on	
  planned	
  for	
  future.	
  
Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  	
  
Architecture	
  
Name	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
Client	
  
control	
  
Data	
  Node	
  
Data	
  Node	
  
Data	
  Node	
  
data	
  
Rack	
   Rack	
   Rack	
  
•  HDFS	
  is	
  block-­‐
based.	
  
•  WriCen	
  in	
  Java.	
  
Sector	
  Distributed	
  File	
  System	
  (SDFS)	
  
Architecture	
  
•  Broadly	
  similar	
  to	
  Google	
  File	
  System	
  and	
  
Hadoop	
  Distributed	
  File	
  System.	
  
•  Uses	
  na+ve	
  file	
  system.	
  	
  It	
  is	
  not	
  block	
  based.	
  
•  Has	
  security	
  server	
  that	
  provides	
  
authoriza+ons.	
  
•  Has	
  mul+ple	
  master	
  name	
  servers	
  so	
  that	
  
there	
  is	
  no	
  single	
  point	
  of	
  failure.	
  
•  Use	
  UDT	
  to	
  support	
  wide	
  area	
  opera+ons.	
  
Sector	
  Distributed	
  File	
  System	
  (SDFS)	
  	
  
Architecture	
  
Master	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
Client	
  
control	
  
Slave	
  Node	
  
Slave	
  Node	
  
Slave	
  Node	
  
data	
  
Rack	
   Rack	
   Rack	
  
•  HDFS	
  is	
  file-­‐
based.	
  
•  WriCen	
  in	
  C++.	
  
•  Security	
  server.	
  
•  Mul+ple	
  masters.	
  
Security	
  Server	
  
control	
  
Master	
  Node	
  
GlusterFS	
  Architecture	
  
•  No	
  metadata	
  server.	
  
•  No	
  single	
  point	
  of	
  failure.	
  
•  Uses	
  algorithms	
  to	
  determine	
  loca+on	
  of	
  data.	
  
•  Can	
  scale	
  out	
  by	
  adding	
  more	
  bricks.	
  
GlusterFS	
  Architecture	
  
Brick	
  
Brick	
  
Brick	
  
Client	
  
Brick	
  
Brick	
  
Brick	
  
data	
  
Rack	
   Rack	
   Rack	
  
File-­‐based.	
  
GlusterFS	
  Server	
  
Sec+on	
  2.3	
  
NoSQL	
  Databases	
  
21	
  
Evolu+on	
  
•  Standard	
  architecture	
  for	
  simple	
  web	
  
applica+ons:	
  
– Presenta+on:	
  front-­‐end,	
  load	
  balanced	
  web	
  servers	
  
– Business	
  logic	
  layer	
  	
  
– Backend	
  database	
  
•  Database	
  layer	
  does	
  not	
  scale	
  with	
  large	
  
numbers	
  of	
  users	
  or	
  large	
  amounts	
  of	
  data	
  
•  Alterna+ves	
  arose	
  
– Sharded	
  (par++oned)	
  databases	
  or	
  master-­‐slave	
  dbs	
  
– memcache	
  
22	
  
Scaling	
  RDMS	
  
•  Master	
  –	
  slave	
  database	
  systems	
  
– Writes	
  to	
  master	
  
– Reads	
  from	
  slaves	
  
– Can	
  be	
  boClenecks	
  wri+ng	
  to	
  slaves;	
  can	
  be	
  
inconsistent	
  
•  Sharded	
  databases	
  
– Applica+ons	
  and	
  queries	
  must	
  understand	
  sharing	
  
schema	
  
– Both	
  reads	
  and	
  writes	
  scale	
  
– No	
  na+ve,	
  direct	
  support	
  for	
  joins	
  across	
  shards	
  
23	
  
NoSQL	
  Systems	
  
•  Suggests	
  No	
  SQL	
  support,	
  also	
  Not	
  Only	
  SQL	
  
•  One	
  or	
  more	
  of	
  the	
  ACID	
  proper+es	
  not	
  
supported	
  
•  Joins	
  generally	
  not	
  supported	
  
•  Usually	
  flexible	
  schemas	
  
•  Some	
  well	
  known	
  examples:	
  Google’s	
  BigTable,	
  
Amazon’s	
  Dynamo	
  &	
  Facebook’s	
  Cassandra	
  
•  Quite	
  a	
  few	
  recent	
  open	
  source	
  systems	
  
24	
  
PaCern	
  3:	
  Put	
  the	
  data	
  into	
  a	
  
NoSQL	
  applica+on.	
  
26	
  
C	
  
A	
   P	
  
Consistency	
  
Availability	
   Par++on-­‐resiliency	
  
CA:	
  available	
  and	
  
consistent,	
  unless	
  there	
  
is	
  a	
  par++on.	
  
AP:	
  a	
  reachable	
  replica	
  
provides	
  service	
  even	
  in	
  a	
  
par++on,	
  but	
  may	
  be	
  
inconsistent.	
  
CP:	
  always	
  consistent,	
  even	
  in	
  a	
  
par++on,	
  but	
  a	
  reachable	
  replica	
  
may	
  deny	
  service	
  without	
  
quorum.	
  
Dynamo,	
  Cassandra	
  	
  
BigTable,	
  
HBase	
  
CAP	
  –	
  Choose	
  Two	
  
Per	
  Opera+on	
  
CAP	
  Theorem	
  
•  Proposed	
  by	
  Eric	
  Brewer,	
  2000	
  
•  Three	
  proper+es	
  of	
  a	
  system:	
  consistency,	
  
availability	
  and	
  par++ons	
  
•  You	
  can	
  have	
  at	
  most	
  two	
  of	
  these	
  three	
  
proper+es	
  for	
  any	
  shared-­‐data	
  system	
  
•  Scale	
  out	
  requires	
  par++ons	
  
•  Most	
  large	
  web-­‐based	
  systems	
  choose	
  
availability	
  over	
  consistency	
  
28	
  Reference:	
  Brewer,	
  PODC	
  2000;	
  Gilbert/Lynch,	
  SIGACT	
  News	
  2002	
  
Eventual	
  Consistency	
  
•  If	
  no	
  updates	
  occur	
  for	
  a	
  while,	
  all	
  updates	
  
eventually	
  propagate	
  through	
  the	
  system	
  and	
  
all	
  the	
  nodes	
  will	
  be	
  consistent	
  
•  Eventually,	
  a	
  node	
  is	
  either	
  updated	
  or	
  
removed	
  from	
  service.	
  	
  	
  
•  Can	
  be	
  implemented	
  with	
  Gossip	
  protocol	
  
•  Amazon’s	
  Dynamo	
  popularized	
  this	
  approach	
  
•  Some+mes	
  this	
  is	
  called	
  BASE	
  (Basically	
  
Available,	
  Soi	
  state,	
  Eventual	
  consistency),	
  as	
  
opposed	
  to	
  ACID	
   29	
  
Different	
  Types	
  of	
  NoSQL	
  Systems	
  
•  Distributed	
  Key-­‐Value	
  Systems	
  
–  Amazon’s	
  S3	
  Key-­‐Value	
  Store	
  (Dynamo)	
  
–  Voldemort	
  
–  Cassandra	
  
•  Column-­‐based	
  Systems	
  
–  BigTable	
  
–  HBase	
  
–  Cassandra	
  
•  Document-­‐based	
  systems	
  
–  CouchDB	
  
30	
  
Hbase	
  Architecture	
  
HRegionServer	
  
Client	
   Client	
   Client	
   Client	
  Client	
  
HBaseMaster	
  
REST API
Disk	
  
HRegionServer	
  
Java	
  Client	
  
Disk	
  
HRegionServer	
  
Disk	
  
HRegionServer	
  
Disk	
  
HRegionServer	
  
Source:	
  Raghu	
  Ramakrishnan	
  
HRegion	
  Server	
  
•  Records	
  par++oned	
  by	
  column	
  family	
  into	
  HStores	
  
–  Each	
  HStore	
  contains	
  many	
  MapFiles	
  
•  All	
  writes	
  to	
  HStore	
  applied	
  to	
  single	
  memcache	
  
•  Reads	
  consult	
  MapFiles	
  and	
  memcache	
  
•  Memcaches	
  flushed	
  as	
  MapFiles	
  (HDFS	
  files)	
  when	
  full	
  
•  Compac+ons	
  limit	
  number	
  of	
  MapFiles	
  
HRegionServer	
  
HStore	
  
MapFiles	
  
Memcache	
  writes	
  
Flush	
  to	
  disk	
  
reads	
  
Source:	
  Raghu	
  Ramakrishnan	
  
Facebook’s	
  Cassandra	
  
•  Modeled	
  aier	
  BigTable’s	
  data	
  model	
  
•  Modeled	
  aier	
  Dynamo’s	
  eventual	
  consistency	
  
•  Peer	
  to	
  peer	
  storage	
  architecture	
  using	
  
consistent	
  hashing	
  (Chord	
  hashing)	
  
33	
  
Databases	
   NoSQL	
  Systems	
  
Scalability	
   100’s	
  TB	
   100’s	
  PB	
  
Func+onality	
   Full	
  SQL-­‐based	
  queries,	
  
including	
  joins	
  
Op+mized	
  access	
  to	
  
sorted	
  tables	
  (tables	
  with	
  
single	
  keys)	
  
Op+mized	
   Databases	
  op+mized	
  
for	
  safe	
  writes	
  
Clouds	
  op+mized	
  for	
  
efficient	
  reads	
  
Consistency	
  
model	
  
ACID	
  (Atomicity,	
  
Consistency,	
  Isola+on	
  
&	
  Durability)	
  –	
  
database	
  always	
  
consist	
  
Eventual	
  consistency	
  –	
  
updates	
  eventually	
  
propagate	
  through	
  
system	
  
Parallelism	
   Difficult	
  because	
  of	
  
ACID	
  model;	
  shared	
  
nothing	
  is	
  possible	
  
Basic	
  design	
  incorporates	
  
parallelism	
  over	
  
commodity	
  components	
  	
  
Scale	
   Racks	
   Data	
  center	
   34	
  
Sec+on	
  2.3	
  	
  
Case	
  Study:	
  Project	
  Matsu	
  
Zoom	
  Levels	
  /	
  Bounds	
  
Zoom	
  Level	
  1:	
  4	
  images	
   Zoom	
  Level	
  2:	
  16	
  images	
  
Zoom	
  Level	
  3:	
  64	
  images	
   Zoom	
  Level	
  4:	
  256	
  images	
  
Source:	
  Andrew	
  Levine	
  
Mapper	
  Input	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Input	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  resizes	
  and/or	
  cuts	
  up	
  the	
  original	
  
image	
  into	
  pieces	
  to	
  output	
  Bounding	
  Boxes	
  
(minx	
  =	
  -­‐135.0	
  miny	
  =	
  45.0	
  maxx	
  =	
  -­‐112.5	
  maxy	
  =	
  67.5)	
  
Step	
  1:	
  Input	
  to	
  Mapper	
  
Step	
  2:	
  Processing	
  in	
  Mapper	
   Step	
  3:	
  Mapper	
  Output	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Mapper	
  Output	
  Value:	
  
Build	
  Tile	
  Cache	
  in	
  the	
  Cloud	
  -­‐	
  Mapper	
  
Source:	
  Andrew	
  Levine	
  
Reducer	
  Key	
  Input:	
  Bounding	
  Box	
  
(minx	
  =	
  -­‐45.0	
  miny	
  =	
  -­‐2.8125	
  maxx	
  =	
  -­‐43.59375	
  maxy	
  =	
  -­‐2.109375)	
  
Reducer	
  Value	
  Input:	
  
Step	
  1:	
  Input	
  to	
  Reducer	
  
…	
  
Step	
  2:	
  Reducer	
  Output	
  
Assemble	
  Images	
  based	
  on	
  bounding	
  box	
  
•  Output	
  to	
  HBase	
  
•  Builds	
  up	
  Layers	
  
for	
  WMS	
  for	
  
various	
  datasets	
  
Build	
  Tile	
  Cache	
  in	
  the	
  Cloud	
  -­‐	
  Reducer	
  
Source:	
  Andrew	
  Levine	
  
HBase	
  Tables	
  
•  Open	
  Geospa+al	
  Consor+um	
  (OGC)	
  Web	
  
Mapping	
  Service	
  (WMS)	
  Query	
  translates	
  to	
  
HBase	
  scheme	
  
– Layers,	
  Styles,	
  Projec+on,	
  Size	
  
•  Table	
  name:	
  WMS	
  Layer	
  
– Row	
  ID:	
  Bounding	
  Box	
  of	
  image	
  
-­‐Column	
  Family:	
  Style	
  Name	
  and	
  Projec+on	
  
	
  	
  	
  -­‐Column	
  Qualifier:	
  Width	
  x	
  Height	
  
	
  	
  	
  	
  	
  	
  -­‐Value:	
  Buffered	
  Image	
  
Sec+on	
  2.4	
  
Distributed	
  Key-­‐Value	
  Stores	
  
S3	
  
PaCern	
  4:	
  Put	
  the	
  data	
  into	
  a	
  
distributed	
  key-­‐value	
  store.	
  
S3	
  Buckets	
  
•  S3	
  bucket	
  names	
  must	
  be	
  unique	
  across	
  AWS	
  
•  A	
  good	
  prac+ce	
  is	
  to	
  use	
  a	
  paCern	
  like	
  
	
   	
  tutorial.osdc.org/dataset1.txt	
  
for	
  a	
  domain	
  you	
  own.	
  
•  The	
  file	
  is	
  then	
  referenced	
  as	
  
	
  tutorial.osdc.org.s3.	
  amazonaws.com/
dataset1.txt	
  
•  If	
  you	
  own	
  osdc.org	
  you	
  can	
  create	
  a	
  DNS	
  
CNAME	
  entry	
  to	
  access	
  the	
  file	
  as	
  
tutorial.osdc.org/dataset1.txt	
  
S3	
  Keys	
  
•  Keys	
  must	
  be	
  unique	
  within	
  a	
  bucket.	
  
•  Values	
  can	
  be	
  as	
  large	
  as	
  5	
  TB	
  (formerly	
  5	
  GB)	
  
S3	
  Security	
  
•  AWS	
  access	
  key	
  (user	
  name)	
  
•  This	
  func+on	
  as	
  your	
  S3	
  username.	
  It	
  is	
  an	
  
alphanumeric	
  text	
  string	
  that	
  uniquely	
  
iden+fies	
  users.	
  	
  
•  AWS	
  Secret	
  key	
  (func+ons	
  as	
  password)	
  
AWS	
  Account	
  Informa+on	
  
Access	
  Keys	
  
User	
  Name	
   Password	
  
Other	
  Amazon	
  Data	
  Services	
  
•  Amazon	
  Simple	
  Database	
  Service	
  (SDS)	
  
•  Amazon’s	
  Elas+c	
  Block	
  Storage	
  (EBS)	
  
Sec+on	
  2.5	
  
Moving	
  Large	
  Data	
  Sets	
  
The	
  Basic	
  Problem	
  
•  TCP	
  was	
  never	
  designed	
  to	
  move	
  large	
  data	
  
sets	
  over	
  wide	
  area	
  high	
  performance	
  
networks.	
  
•  As	
  a	
  general	
  rule,	
  reading	
  data	
  off	
  disks	
  is	
  
slower	
  than	
  transpor+ng	
  it	
  over	
  the	
  network.	
  	
  	
  
TCP Throughput vs RTT and Packet Loss
0.01%
0.05%
0.1%
0.1%
0.5%
1000
800
600
400
200
1 10 100 200 400
1000
800
600
400
200
Throughput(Mb/s)
Round Trip Time (ms)
LAN US-EU US-ASIAUS
Source:	
  Yunhong	
  Gu,	
  	
  2007,	
  experiments	
  over	
  wide	
  area	
  1G.	
  
The	
  Solu+on	
  
•  Use	
  parallel	
  TCP	
  streams	
  
– GridFTP	
  
•  Use	
  specialized	
  network	
  protocols	
  
– UDT,	
  FAST,	
  etc.	
  
•  Use	
  RAID	
  to	
  stripe	
  data	
  across	
  disks	
  to	
  
improve	
  throughput	
  when	
  reading	
  
•  These	
  techniques	
  are	
  well	
  understood	
  in	
  HEP,	
  
astronomy,	
  but	
  not	
  yet	
  in	
  biology.	
  
Case	
  Study:	
  Bio-­‐mirror	
  
[The	
  open	
  source	
  GridFTP]	
  from	
  the	
  Globus	
  project	
  has	
  
recently	
  been	
  improved	
  to	
  offer	
  UDP-­‐based	
  file	
  transport,	
  
with	
  long-­‐distance	
  speed	
  improvements	
  of	
  3x	
  to	
  10x	
  over	
  
the	
  usual	
  TCP-­‐based	
  file	
  transport.	
  
	
  
-­‐-­‐	
  Don	
  Gilbert,	
  August	
  2010,	
  bio-­‐mirror.net	
  
Moving	
  113GB	
  of	
  Bio-­‐mirror	
  Data	
  
Site	
   RTT	
   TCP	
   UDT	
   TCP/UDT	
   Km	
  
NCSA	
   10	
   139	
   139	
   1	
   200	
  
Purdue	
   17	
   125	
   125	
   1	
   500	
  
ORNL	
   25	
   361	
   120	
   3	
   1,200	
  
TACC	
   37	
   616	
   120	
   55	
   2,000	
  
SDSC	
   65	
   750	
   475	
   1.6	
   3,300	
  
CSTNET	
   274	
   3722	
   304	
   12	
   12,000	
  
GridFTP	
  TCP	
  and	
  UDT	
  transfer	
  +mes	
  for	
  113	
  GB	
  from	
  gridip.bio-­‐mirror.net/biomirror/
blast/	
  (Indiana	
  USA).	
  	
  All	
  TCP	
  and	
  UDT	
  +mes	
  in	
  minutes.	
  	
  Source:	
  	
  hCp://gridip.bio-­‐
mirror.net/biomirror/	
  
Case	
  Study:	
  CGI	
  60	
  Genomes	
  
•  Trace	
  by	
  Complete	
  Genomics	
  showing	
  performance	
  of	
  
moving	
  60	
  complete	
  human	
  genomes	
  from	
  Mountain	
  
View	
  to	
  Chicago	
  using	
  the	
  open	
  source	
  Sector/UDT.	
  
•  Approximately	
  18	
  TB	
  at	
  about	
  0.5	
  Mbs	
  on	
  1G	
  link.	
  
Source:	
  Complete	
  Genomics.	
  	
  	
  
Resource	
  Use	
  
Protocol	
   CPU	
  Usage*	
   Memory*	
  
GridFTP	
  (UDT)	
   1.0%	
  -­‐	
  3.0%	
  	
   40	
  Mb	
  
GridFTP	
  (TCP)	
   0.1%	
  -­‐	
  0.6%	
   6	
  Mb	
  
*CPU	
  and	
  memory	
  usage	
  collected	
  by	
  	
  Don	
  Gilbert.	
  	
  	
  He	
  reports	
  that	
  rsync	
  uses	
  more	
  
CPU	
  than	
  GridFTP	
  with	
  UDT.	
  	
  	
  Source:	
  hCp://gridip.bio-­‐mirror.net/biomirror/.	
  
Sector/Sphere	
  
•  Sector/Sphere	
  is	
  a	
  pla{orm	
  for	
  data	
  intensive	
  
compu+ng	
  built	
  over	
  UDT	
  and	
  designed	
  to	
  
support	
  geographically	
  distributed	
  clusters.	
  	
  
Ques+ons?	
  
For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  see	
  
rgrossman.com	
  

More Related Content

What's hot

AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsAaron Brooks
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3Gwen (Chen) Shapira
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 

What's hot (20)

AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 

Similar to Managing Big Data: An Introduction to Data Intensive Computing

Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra Nikiforos Botis
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedAnant Kumar
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfSumanthReddy540432
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 

Similar to Managing Big Data: An Introduction to Data Intensive Computing (20)

Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Oracle Big Data Cloud service
Oracle Big Data Cloud serviceOracle Big Data Cloud service
Oracle Big Data Cloud service
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 

Recently uploaded

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 

Recently uploaded (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

Managing Big Data: An Introduction to Data Intensive Computing

  • 1. An  Introduc+on  to     Data  Intensive  Compu+ng     Chapter  2:  Data  Management   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
  • 2. 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)   2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)   3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems   4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3. What  Are  the  Choices?   Databases     (SqlServer,  Oracle,  DB2)   File  Systems   Distributed  File  Systems   (Hadoop,  Sector)   Clustered   File  Systems   (glusterfs,  …)   NoSQL  Databases   (HBase,  Accumulo,   Cassandra,  SimpleDB,  …)   Applica+ons     (R,  SAS,  Excel,  etc.  )  
  • 4. What  is  the  Fundamental  Trade  Off?   Scale  up   Scale  out   vs   …  
  • 6. Advice  From  Jim  Gray   1.  Analyzing  big  data  requires   scale-­‐out  solu+ons  not  scale-­‐up   solu+ons  (GrayWulf)   2.  Move  the  analysis  to  the  data.   3.  Work  with  scien+sts  to  find  the   most  common  “20  queries”  and   make  them  fast.   4.  Go  from  “working  to  working.”  
  • 7. PaCern  1:  Put  the  metadata  in  a   database  and  point  to  files  in  a   file  system.    
  • 8. Example:  Sloan  Digital  Sky  Survey   •  Two  surveys  in  one   – Photometric  survey  in  5  bands   – Spectroscopic  redshii  survey   •  Data  is  public   – 40  TB  of  raw  data   – 5  TB  processed  catalogs   – 2.5  Terapixels  of  images   •  Catalog  uses  Microsoi  SQLServer   •  Started  in  1992,  finished  in  2008   •  JHU  SkyServer  serves  millions  of  queries    
  • 9. Example:  Bionimbus  Genomics  Cloud   www.bionimbus.org  
  • 10. Database   Services   Analysis  Pipelines   &  Re-­‐analysis   Services   GWT-­‐based  Front  End   Data     Cloud  Services   Data   Inges+on   Services   U+lity  Cloud   Services   Intercloud   Services  
  • 11. Database   Services   Analysis  Pipelines   &  Re-­‐analysis   Services   GWT-­‐based  Front  End   Large  Data     Cloud  Services   Data   Inges+on   Services   Elas+c  Cloud   Services   Intercloud   Services   (Hadoop,   Sector/Sphere)   (Eucalyptus,   OpenStack)   (PostgreSQL)   ID  Service   (UDT,   replica+on)  
  • 12. Sec+on  2.2   Distributed  File  Systems   Sector/Sphere  
  • 13. Hadoop’s  Large  Data  Cloud   Storage  Services   Compute  Services   13 Hadoop’s  Stack   Applica+ons   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  MapReduce   Data  Services   NoSQL  Databases  
  • 14. PaCern  2:  Put  the  data  into  a   distributed  file  system.  
  • 15. Hadoop  Design   •  Designed  to  run  over  commodity  components   that  fail.   •  Data  replicated,  typically  three  +mes.   •  Block-­‐based  storage.   •  Op+mized  for  efficient  scans  with  high   throughput,  not  low  latency  access.   •  Designed  for  write  once,  read  many.   •  Append  opera+on  planned  for  future.  
  • 16. Hadoop  Distributed  File  System  (HDFS)     Architecture   Name  Node   Data  Node   Data  Node   Data  Node   Client   control   Data  Node   Data  Node   Data  Node   data   Rack   Rack   Rack   •  HDFS  is  block-­‐ based.   •  WriCen  in  Java.  
  • 17. Sector  Distributed  File  System  (SDFS)   Architecture   •  Broadly  similar  to  Google  File  System  and   Hadoop  Distributed  File  System.   •  Uses  na+ve  file  system.    It  is  not  block  based.   •  Has  security  server  that  provides   authoriza+ons.   •  Has  mul+ple  master  name  servers  so  that   there  is  no  single  point  of  failure.   •  Use  UDT  to  support  wide  area  opera+ons.  
  • 18. Sector  Distributed  File  System  (SDFS)     Architecture   Master  Node   Slave  Node   Slave  Node   Slave  Node   Client   control   Slave  Node   Slave  Node   Slave  Node   data   Rack   Rack   Rack   •  HDFS  is  file-­‐ based.   •  WriCen  in  C++.   •  Security  server.   •  Mul+ple  masters.   Security  Server   control   Master  Node  
  • 19. GlusterFS  Architecture   •  No  metadata  server.   •  No  single  point  of  failure.   •  Uses  algorithms  to  determine  loca+on  of  data.   •  Can  scale  out  by  adding  more  bricks.  
  • 20. GlusterFS  Architecture   Brick   Brick   Brick   Client   Brick   Brick   Brick   data   Rack   Rack   Rack   File-­‐based.   GlusterFS  Server  
  • 21. Sec+on  2.3   NoSQL  Databases   21  
  • 22. Evolu+on   •  Standard  architecture  for  simple  web   applica+ons:   – Presenta+on:  front-­‐end,  load  balanced  web  servers   – Business  logic  layer     – Backend  database   •  Database  layer  does  not  scale  with  large   numbers  of  users  or  large  amounts  of  data   •  Alterna+ves  arose   – Sharded  (par++oned)  databases  or  master-­‐slave  dbs   – memcache   22  
  • 23. Scaling  RDMS   •  Master  –  slave  database  systems   – Writes  to  master   – Reads  from  slaves   – Can  be  boClenecks  wri+ng  to  slaves;  can  be   inconsistent   •  Sharded  databases   – Applica+ons  and  queries  must  understand  sharing   schema   – Both  reads  and  writes  scale   – No  na+ve,  direct  support  for  joins  across  shards   23  
  • 24. NoSQL  Systems   •  Suggests  No  SQL  support,  also  Not  Only  SQL   •  One  or  more  of  the  ACID  proper+es  not   supported   •  Joins  generally  not  supported   •  Usually  flexible  schemas   •  Some  well  known  examples:  Google’s  BigTable,   Amazon’s  Dynamo  &  Facebook’s  Cassandra   •  Quite  a  few  recent  open  source  systems   24  
  • 25. PaCern  3:  Put  the  data  into  a   NoSQL  applica+on.  
  • 26. 26  
  • 27. C   A   P   Consistency   Availability   Par++on-­‐resiliency   CA:  available  and   consistent,  unless  there   is  a  par++on.   AP:  a  reachable  replica   provides  service  even  in  a   par++on,  but  may  be   inconsistent.   CP:  always  consistent,  even  in  a   par++on,  but  a  reachable  replica   may  deny  service  without   quorum.   Dynamo,  Cassandra     BigTable,   HBase   CAP  –  Choose  Two   Per  Opera+on  
  • 28. CAP  Theorem   •  Proposed  by  Eric  Brewer,  2000   •  Three  proper+es  of  a  system:  consistency,   availability  and  par++ons   •  You  can  have  at  most  two  of  these  three   proper+es  for  any  shared-­‐data  system   •  Scale  out  requires  par++ons   •  Most  large  web-­‐based  systems  choose   availability  over  consistency   28  Reference:  Brewer,  PODC  2000;  Gilbert/Lynch,  SIGACT  News  2002  
  • 29. Eventual  Consistency   •  If  no  updates  occur  for  a  while,  all  updates   eventually  propagate  through  the  system  and   all  the  nodes  will  be  consistent   •  Eventually,  a  node  is  either  updated  or   removed  from  service.       •  Can  be  implemented  with  Gossip  protocol   •  Amazon’s  Dynamo  popularized  this  approach   •  Some+mes  this  is  called  BASE  (Basically   Available,  Soi  state,  Eventual  consistency),  as   opposed  to  ACID   29  
  • 30. Different  Types  of  NoSQL  Systems   •  Distributed  Key-­‐Value  Systems   –  Amazon’s  S3  Key-­‐Value  Store  (Dynamo)   –  Voldemort   –  Cassandra   •  Column-­‐based  Systems   –  BigTable   –  HBase   –  Cassandra   •  Document-­‐based  systems   –  CouchDB   30  
  • 31. Hbase  Architecture   HRegionServer   Client   Client   Client   Client  Client   HBaseMaster   REST API Disk   HRegionServer   Java  Client   Disk   HRegionServer   Disk   HRegionServer   Disk   HRegionServer   Source:  Raghu  Ramakrishnan  
  • 32. HRegion  Server   •  Records  par++oned  by  column  family  into  HStores   –  Each  HStore  contains  many  MapFiles   •  All  writes  to  HStore  applied  to  single  memcache   •  Reads  consult  MapFiles  and  memcache   •  Memcaches  flushed  as  MapFiles  (HDFS  files)  when  full   •  Compac+ons  limit  number  of  MapFiles   HRegionServer   HStore   MapFiles   Memcache  writes   Flush  to  disk   reads   Source:  Raghu  Ramakrishnan  
  • 33. Facebook’s  Cassandra   •  Modeled  aier  BigTable’s  data  model   •  Modeled  aier  Dynamo’s  eventual  consistency   •  Peer  to  peer  storage  architecture  using   consistent  hashing  (Chord  hashing)   33  
  • 34. Databases   NoSQL  Systems   Scalability   100’s  TB   100’s  PB   Func+onality   Full  SQL-­‐based  queries,   including  joins   Op+mized  access  to   sorted  tables  (tables  with   single  keys)   Op+mized   Databases  op+mized   for  safe  writes   Clouds  op+mized  for   efficient  reads   Consistency   model   ACID  (Atomicity,   Consistency,  Isola+on   &  Durability)  –   database  always   consist   Eventual  consistency  –   updates  eventually   propagate  through   system   Parallelism   Difficult  because  of   ACID  model;  shared   nothing  is  possible   Basic  design  incorporates   parallelism  over   commodity  components     Scale   Racks   Data  center   34  
  • 35. Sec+on  2.3     Case  Study:  Project  Matsu  
  • 36. Zoom  Levels  /  Bounds   Zoom  Level  1:  4  images   Zoom  Level  2:  16  images   Zoom  Level  3:  64  images   Zoom  Level  4:  256  images   Source:  Andrew  Levine  
  • 37. Mapper  Input  Key:  Bounding  Box   Mapper  Input  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  resizes  and/or  cuts  up  the  original   image  into  pieces  to  output  Bounding  Boxes   (minx  =  -­‐135.0  miny  =  45.0  maxx  =  -­‐112.5  maxy  =  67.5)   Step  1:  Input  to  Mapper   Step  2:  Processing  in  Mapper   Step  3:  Mapper  Output   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Build  Tile  Cache  in  the  Cloud  -­‐  Mapper   Source:  Andrew  Levine  
  • 38. Reducer  Key  Input:  Bounding  Box   (minx  =  -­‐45.0  miny  =  -­‐2.8125  maxx  =  -­‐43.59375  maxy  =  -­‐2.109375)   Reducer  Value  Input:   Step  1:  Input  to  Reducer   …   Step  2:  Reducer  Output   Assemble  Images  based  on  bounding  box   •  Output  to  HBase   •  Builds  up  Layers   for  WMS  for   various  datasets   Build  Tile  Cache  in  the  Cloud  -­‐  Reducer   Source:  Andrew  Levine  
  • 39. HBase  Tables   •  Open  Geospa+al  Consor+um  (OGC)  Web   Mapping  Service  (WMS)  Query  translates  to   HBase  scheme   – Layers,  Styles,  Projec+on,  Size   •  Table  name:  WMS  Layer   – Row  ID:  Bounding  Box  of  image   -­‐Column  Family:  Style  Name  and  Projec+on        -­‐Column  Qualifier:  Width  x  Height              -­‐Value:  Buffered  Image  
  • 40. Sec+on  2.4   Distributed  Key-­‐Value  Stores   S3  
  • 41. PaCern  4:  Put  the  data  into  a   distributed  key-­‐value  store.  
  • 42. S3  Buckets   •  S3  bucket  names  must  be  unique  across  AWS   •  A  good  prac+ce  is  to  use  a  paCern  like      tutorial.osdc.org/dataset1.txt   for  a  domain  you  own.   •  The  file  is  then  referenced  as    tutorial.osdc.org.s3.  amazonaws.com/ dataset1.txt   •  If  you  own  osdc.org  you  can  create  a  DNS   CNAME  entry  to  access  the  file  as   tutorial.osdc.org/dataset1.txt  
  • 43. S3  Keys   •  Keys  must  be  unique  within  a  bucket.   •  Values  can  be  as  large  as  5  TB  (formerly  5  GB)  
  • 44. S3  Security   •  AWS  access  key  (user  name)   •  This  func+on  as  your  S3  username.  It  is  an   alphanumeric  text  string  that  uniquely   iden+fies  users.     •  AWS  Secret  key  (func+ons  as  password)  
  • 46. Access  Keys   User  Name   Password  
  • 47. Other  Amazon  Data  Services   •  Amazon  Simple  Database  Service  (SDS)   •  Amazon’s  Elas+c  Block  Storage  (EBS)  
  • 48. Sec+on  2.5   Moving  Large  Data  Sets  
  • 49. The  Basic  Problem   •  TCP  was  never  designed  to  move  large  data   sets  over  wide  area  high  performance   networks.   •  As  a  general  rule,  reading  data  off  disks  is   slower  than  transpor+ng  it  over  the  network.      
  • 50. TCP Throughput vs RTT and Packet Loss 0.01% 0.05% 0.1% 0.1% 0.5% 1000 800 600 400 200 1 10 100 200 400 1000 800 600 400 200 Throughput(Mb/s) Round Trip Time (ms) LAN US-EU US-ASIAUS Source:  Yunhong  Gu,    2007,  experiments  over  wide  area  1G.  
  • 51. The  Solu+on   •  Use  parallel  TCP  streams   – GridFTP   •  Use  specialized  network  protocols   – UDT,  FAST,  etc.   •  Use  RAID  to  stripe  data  across  disks  to   improve  throughput  when  reading   •  These  techniques  are  well  understood  in  HEP,   astronomy,  but  not  yet  in  biology.  
  • 52. Case  Study:  Bio-­‐mirror   [The  open  source  GridFTP]  from  the  Globus  project  has   recently  been  improved  to  offer  UDP-­‐based  file  transport,   with  long-­‐distance  speed  improvements  of  3x  to  10x  over   the  usual  TCP-­‐based  file  transport.     -­‐-­‐  Don  Gilbert,  August  2010,  bio-­‐mirror.net  
  • 53. Moving  113GB  of  Bio-­‐mirror  Data   Site   RTT   TCP   UDT   TCP/UDT   Km   NCSA   10   139   139   1   200   Purdue   17   125   125   1   500   ORNL   25   361   120   3   1,200   TACC   37   616   120   55   2,000   SDSC   65   750   475   1.6   3,300   CSTNET   274   3722   304   12   12,000   GridFTP  TCP  and  UDT  transfer  +mes  for  113  GB  from  gridip.bio-­‐mirror.net/biomirror/ blast/  (Indiana  USA).    All  TCP  and  UDT  +mes  in  minutes.    Source:    hCp://gridip.bio-­‐ mirror.net/biomirror/  
  • 54. Case  Study:  CGI  60  Genomes   •  Trace  by  Complete  Genomics  showing  performance  of   moving  60  complete  human  genomes  from  Mountain   View  to  Chicago  using  the  open  source  Sector/UDT.   •  Approximately  18  TB  at  about  0.5  Mbs  on  1G  link.   Source:  Complete  Genomics.      
  • 55. Resource  Use   Protocol   CPU  Usage*   Memory*   GridFTP  (UDT)   1.0%  -­‐  3.0%     40  Mb   GridFTP  (TCP)   0.1%  -­‐  0.6%   6  Mb   *CPU  and  memory  usage  collected  by    Don  Gilbert.      He  reports  that  rsync  uses  more   CPU  than  GridFTP  with  UDT.      Source:  hCp://gridip.bio-­‐mirror.net/biomirror/.  
  • 56. Sector/Sphere   •  Sector/Sphere  is  a  pla{orm  for  data  intensive   compu+ng  built  over  UDT  and  designed  to   support  geographically  distributed  clusters.    
  • 57. Ques+ons?   For  the  most  current  version  of  these  notes,  see   rgrossman.com