An	
  Introduc+on	
  to	
  	
  
 Data	
  Intensive	
  Compu+ng	
  
                    	
  
Chapter	
  2:	
  Data	
  Management	
  
          Robert	
  Grossman	
  
         University	
  of	
  Chicago	
  
          Open	
  Data	
  Group	
  
                      	
  
            Collin	
  BenneC	
  
          Open	
  Data	
  Group	
  
                      	
  
         November	
  14,	
  2011	
  
                                           1	
  
1.  Introduc+on	
  (0830-­‐0900)	
  
     a.  Data	
  clouds	
  (e.g.	
  Hadoop)	
  
     b.  U+lity	
  clouds	
  (e.g.	
  Amazon)	
  
2.  Managing	
  Big	
  Data	
  (0900-­‐0945)	
  
     a.  Databases	
  
     b.  Distributed	
  File	
  Systems	
  (e.g.	
  Hadoop)	
  
     c.  NoSql	
  databases	
  (e.g.	
  HBase)	
  
3.  Processing	
  Big	
  Data	
  (0945-­‐1000	
  and	
  1030-­‐1100)	
  
     a.  Mul+ple	
  Virtual	
  Machines	
  &	
  Message	
  Queues	
  
     b.  MapReduce	
  
     c.  Streams	
  over	
  distributed	
  file	
  systems	
  
4.  Lab	
  using	
  Amazon’s	
  Elas+c	
  Map	
  Reduce	
  
      (1100-­‐1200)	
  
	
  
What	
  Are	
  the	
  Choices?	
  
 Applica+ons	
  	
                                   File	
  Systems	
  
 (R,	
  SAS,	
  Excel,	
  etc.	
  )	
  
                                               Clustered	
  
Databases	
  	
                                File	
  Systems	
  
(SqlServer,	
  Oracle,	
  DB2)	
               (glusterfs,	
  …)	
  

                                          Distributed	
  File	
  Systems	
  
NoSQL	
  Databases	
                      (Hadoop,	
  Sector)	
  
(HBase,	
  Accumulo,	
  
Cassandra,	
  SimpleDB,	
  …)	
  
What	
  is	
  the	
  Fundamental	
  Trade	
  Off?	
  


                    vs	
                      …	
  

                                Scale	
  out	
  
  Scale	
  up	
  
2.1	
  	
  Databases	
  
Advice	
  From	
  Jim	
  Gray	
  
1.  Analyzing	
  big	
  data	
  requires	
  
    scale-­‐out	
  solu+ons	
  not	
  scale-­‐up	
  
    solu+ons	
  (GrayWulf)	
  
2.  Move	
  the	
  analysis	
  to	
  the	
  data.	
  
3.  Work	
  with	
  scien+sts	
  to	
  find	
  the	
  
    most	
  common	
  “20	
  queries”	
  and	
  
    make	
  them	
  fast.	
  
4.  Go	
  from	
  “working	
  to	
  working.”	
  
PaCern	
  1:	
  Put	
  the	
  metadata	
  in	
  a	
  
 database	
  and	
  point	
  to	
  files	
  in	
  a	
  
                file	
  system.	
  	
  
Example:	
  Sloan	
  Digital	
  Sky	
  Survey	
  
•  Two	
  surveys	
  in	
  one	
  
    –  Photometric	
  survey	
  in	
  5	
  bands	
  
    –  Spectroscopic	
  redshii	
  survey	
  
•  Data	
  is	
  public	
  
    –  40	
  TB	
  of	
  raw	
  data	
  
    –  5	
  TB	
  processed	
  catalogs	
  
    –  2.5	
  Terapixels	
  of	
  images	
  
•  Catalog	
  uses	
  Microsoi	
  SQLServer	
  
•  Started	
  in	
  1992,	
  finished	
  in	
  2008	
  
•  JHU	
  SkyServer	
  serves	
  millions	
  of	
  queries	
  	
  
Example:	
  Bionimbus	
  Genomics	
  Cloud	
  




         www.bionimbus.org	
  
GWT-­‐based	
  Front	
  End	
  

                                               U+lity	
  Cloud	
  
                                                Services	
  

                        Analysis	
  Pipelines	
  
Database	
                &	
  Re-­‐analysis	
  
Services	
                                            Intercloud	
  
                               Services	
  
                                                      Services	
  


  Data	
  
Inges+on	
                     Data	
  	
  
                           Cloud	
  Services	
  
 Services	
  
(Eucalyptus,	
  
                    GWT-­‐based	
  Front	
  End	
  
                                                                   OpenStack)	
  

                                                            Elas+c	
  Cloud	
  
(PostgreSQL)	
                                                Services	
  

                                   Analysis	
  Pipelines	
  
        Database	
                   &	
  Re-­‐analysis	
  
        Services	
                                                 Intercloud	
  
                                          Services	
  
                                                                   Services	
  
ID	
  Service	
  
         Data	
                                                         (UDT,	
  
       Inges+on	
                       Large	
  Data	
  	
             replica+on)	
  
                                      Cloud	
  Services	
  
        Services	
                                                 (Hadoop,	
  
                                                                   Sector/Sphere)	
  
Sec+on	
  2.2	
  
Distributed	
  File	
  Systems	
  




   Sector/Sphere	
  
Hadoop’s	
  Large	
  Data	
  Cloud	
  

      Applica+ons	
  

  Compute	
  Services	
      Hadoop’s	
  MapReduce	
  

Data	
  Services	
           NoSQL	
  Databases	
  

   Storage	
  Services	
     Hadoop	
  Distributed	
  File	
  
                             System	
  (HDFS)	
  
  Hadoop’s	
  Stack	
  

                                                                 13
PaCern	
  2:	
  Put	
  the	
  data	
  into	
  a	
  
  distributed	
  file	
  system.	
  
Hadoop	
  Design	
  
•  Designed	
  to	
  run	
  over	
  commodity	
  components	
  
   that	
  fail.	
  
•  Data	
  replicated,	
  typically	
  three	
  +mes.	
  
•  Block-­‐based	
  storage.	
  
•  Op+mized	
  for	
  efficient	
  scans	
  with	
  high	
  
   throughput,	
  not	
  low	
  latency	
  access.	
  
•  Designed	
  for	
  write	
  once,	
  read	
  many.	
  
•  Append	
  opera+on	
  planned	
  for	
  future.	
  
Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  	
  
               Architecture	
  
                control	
                        •  HDFS	
  is	
  block-­‐
  Client	
                    Name	
  Node	
        based.	
  
                                                 •  WriCen	
  in	
  Java.	
  
          data	
  

 Data	
  Node	
               Data	
  Node	
            Data	
  Node	
  


 Data	
  Node	
               Data	
  Node	
            Data	
  Node	
  

   Rack	
                      Rack	
                    Rack	
  
Sector	
  Distributed	
  File	
  System	
  (SDFS)	
  
                  Architecture	
  
•  Broadly	
  similar	
  to	
  Google	
  File	
  System	
  and	
  
   Hadoop	
  Distributed	
  File	
  System.	
  
•  Uses	
  na+ve	
  file	
  system.	
  	
  It	
  is	
  not	
  block	
  based.	
  
•  Has	
  security	
  server	
  that	
  provides	
  
   authoriza+ons.	
  
•  Has	
  mul+ple	
  master	
  name	
  servers	
  so	
  that	
  
   there	
  is	
  no	
  single	
  point	
  of	
  failure.	
  
•  Use	
  UDT	
  to	
  support	
  wide	
  area	
  opera+ons.	
  
Sector	
  Distributed	
  File	
  System	
  (SDFS)	
  	
  
control	
         Architecture	
  
              control	
   Master	
  Node	
   •  HDFS	
  is	
  file-­‐
   Client	
                                         based.	
  
                          Master	
  Node	
   •  WriCen	
  in	
  C++.	
  
                          Security	
  Server	
   •  Security	
  server.	
  
     data	
                                      •  Mul+ple	
  masters.	
  

  Slave	
  Node	
          Slave	
  Node	
              Slave	
  Node	
  


  Slave	
  Node	
          Slave	
  Node	
              Slave	
  Node	
  

    Rack	
                  Rack	
                       Rack	
  
GlusterFS	
  Architecture	
  
•    No	
  metadata	
  server.	
  
•    No	
  single	
  point	
  of	
  failure.	
  
•    Uses	
  algorithms	
  to	
  determine	
  loca+on	
  of	
  data.	
  
•    Can	
  scale	
  out	
  by	
  adding	
  more	
  bricks.	
  
GlusterFS	
  Architecture	
  
                                       File-­‐based.	
  
Client	
  
                                       GlusterFS	
  Server	
  
             data	
  

   Brick	
                 Brick	
                 Brick	
  


   Brick	
                 Brick	
                 Brick	
  

 Rack	
                 Rack	
                  Rack	
  
Sec+on	
  2.3	
  
NoSQL	
  Databases	
  




                         21	
  
Evolu+on	
  
•  Standard	
  architecture	
  for	
  simple	
  web	
  
   applica+ons:	
  
    –  Presenta+on:	
  front-­‐end,	
  load	
  balanced	
  web	
  servers	
  
    –  Business	
  logic	
  layer	
  	
  
    –  Backend	
  database	
  
•  Database	
  layer	
  does	
  not	
  scale	
  with	
  large	
  
   numbers	
  of	
  users	
  or	
  large	
  amounts	
  of	
  data	
  
•  Alterna+ves	
  arose	
  
    –  Sharded	
  (par++oned)	
  databases	
  or	
  master-­‐slave	
  dbs	
  
    –  memcache	
  
                                                                           22	
  
Scaling	
  RDMS	
  
•  Master	
  –	
  slave	
  database	
  systems	
  
   –  Writes	
  to	
  master	
  
   –  Reads	
  from	
  slaves	
  
   –  Can	
  be	
  boClenecks	
  wri+ng	
  to	
  slaves;	
  can	
  be	
  
      inconsistent	
  
•  Sharded	
  databases	
  
   –  Applica+ons	
  and	
  queries	
  must	
  understand	
  sharing	
  
      schema	
  
   –  Both	
  reads	
  and	
  writes	
  scale	
  
   –  No	
  na+ve,	
  direct	
  support	
  for	
  joins	
  across	
  shards	
  
                                                                             23	
  
NoSQL	
  Systems	
  
•  Suggests	
  No	
  SQL	
  support,	
  also	
  Not	
  Only	
  SQL	
  
•  One	
  or	
  more	
  of	
  the	
  ACID	
  proper+es	
  not	
  
   supported	
  
•  Joins	
  generally	
  not	
  supported	
  
•  Usually	
  flexible	
  schemas	
  
•  Some	
  well	
  known	
  examples:	
  Google’s	
  BigTable,	
  
   Amazon’s	
  Dynamo	
  &	
  Facebook’s	
  Cassandra	
  
•  Quite	
  a	
  few	
  recent	
  open	
  source	
  systems	
  
                                                                   24	
  
PaCern	
  3:	
  Put	
  the	
  data	
  into	
  a	
  
    NoSQL	
  applica+on.	
  
26	
  
CAP	
  –	
  Choose	
  Two	
                 Consistency	
  
Per	
  Opera+on	
  
                                                    C	
  

                                                                       CP:	
  always	
  consistent,	
  even	
  in	
  a	
  
     CA:	
  available	
  and	
                                         par++on,	
  but	
  a	
  reachable	
  replica	
  
     consistent,	
  unless	
  there	
                                  may	
  deny	
  service	
  without	
  
     is	
  a	
  par++on.	
                                             quorum.	
  

                                       BigTable,	
  
                                       HBase	
  

                                     Dynamo,	
  Cassandra	
  	
  
                   A	
             AP:	
  a	
  reachable	
  replica	
                P	
  
        Availability	
             provides	
  service	
  even	
  in	
  a	
         Par++on-­‐resiliency	
  
                                   par++on,	
  but	
  may	
  be	
  
                                   inconsistent.	
  
CAP	
  Theorem	
  
•  Proposed	
  by	
  Eric	
  Brewer,	
  2000	
  
•  Three	
  proper+es	
  of	
  a	
  system:	
  consistency,	
  
   availability	
  and	
  par++ons	
  
•  You	
  can	
  have	
  at	
  most	
  two	
  of	
  these	
  three	
  
   proper+es	
  for	
  any	
  shared-­‐data	
  system	
  
•  Scale	
  out	
  requires	
  par++ons	
  
•  Most	
  large	
  web-­‐based	
  systems	
  choose	
  
   availability	
  over	
  consistency	
  
Reference:	
  Brewer,	
  PODC	
  2000;	
  Gilbert/Lynch,	
  SIGACT	
  News	
  2002	
     28	
  
Eventual	
  Consistency	
  
•  If	
  no	
  updates	
  occur	
  for	
  a	
  while,	
  all	
  updates	
  
   eventually	
  propagate	
  through	
  the	
  system	
  and	
  
   all	
  the	
  nodes	
  will	
  be	
  consistent	
  
•  Eventually,	
  a	
  node	
  is	
  either	
  updated	
  or	
  
   removed	
  from	
  service.	
  	
  	
  
•  Can	
  be	
  implemented	
  with	
  Gossip	
  protocol	
  
•  Amazon’s	
  Dynamo	
  popularized	
  this	
  approach	
  
•  Some+mes	
  this	
  is	
  called	
  BASE	
  (Basically	
  
   Available,	
  Soi	
  state,	
  Eventual	
  consistency),	
  as	
  
   opposed	
  to	
  ACID	
                                               29	
  
Different	
  Types	
  of	
  NoSQL	
  Systems	
  
•  Distributed	
  Key-­‐Value	
  Systems	
  
   –  Amazon’s	
  S3	
  Key-­‐Value	
  Store	
  (Dynamo)	
  
   –  Voldemort	
  
   –  Cassandra	
  
•  Column-­‐based	
  Systems	
  
   –  BigTable	
  
   –  HBase	
  
   –  Cassandra	
  
•  Document-­‐based	
  systems	
  
   –  CouchDB	
  
                                                               30	
  
Hbase	
  Architecture	
  
                            Client	
                Client	
                Client	
                 Client	
        Client	
  


                                                                                                Java	
  Client	
  
                                                           REST API


                                         HBaseMaster	
  




                       HRegionServer	
                HRegionServer	
  
                                                     HRegionServer	
                     HRegionServer	
             HRegionServer	
  




                              Disk	
                             Disk	
                         Disk	
                        Disk	
  




Source:	
  Raghu	
  Ramakrishnan	
  
HRegion	
  Server	
  
  •  Records	
  par++oned	
  by	
  column	
  family	
  into	
  HStores	
  
        –  Each	
  HStore	
  contains	
  many	
  MapFiles	
  
  •    All	
  writes	
  to	
  HStore	
  applied	
  to	
  single	
  memcache	
  
  •    Reads	
  consult	
  MapFiles	
  and	
  memcache	
  
  •    Memcaches	
  flushed	
  as	
  MapFiles	
  (HDFS	
  files)	
  when	
  full	
  
  •    Compac+ons	
  limit	
  number	
  of	
  MapFiles	
  
                         HRegionServer	
  

        writes	
                   Memcache	
  

                                           Flush	
  to	
  disk	
  
                            HStore	
  
         reads	
            MapFiles	
  




Source:	
  Raghu	
  Ramakrishnan	
  
Facebook’s	
  Cassandra	
  
•  Modeled	
  aier	
  BigTable’s	
  data	
  model	
  
•  Modeled	
  aier	
  Dynamo’s	
  eventual	
  consistency	
  
•  Peer	
  to	
  peer	
  storage	
  architecture	
  using	
  
   consistent	
  hashing	
  (Chord	
  hashing)	
  




                                                           33	
  
Databases	
                        NoSQL	
  Systems	
  
Scalability	
       100’s	
  TB	
                      100’s	
  PB	
  
Func+onality	
   Full	
  SQL-­‐based	
  queries,	
   Op+mized	
  access	
  to	
  
                 including	
  joins	
                sorted	
  tables	
  (tables	
  with	
  
                                                     single	
  keys)	
  
Op+mized	
          Databases	
  op+mized	
            Clouds	
  op+mized	
  for	
  
                    for	
  safe	
  writes	
            efficient	
  reads	
  
Consistency	
   ACID	
  (Atomicity,	
                  Eventual	
  consistency	
  –	
  
model	
         Consistency,	
  Isola+on	
             updates	
  eventually	
  
                &	
  Durability)	
  –	
                propagate	
  through	
  
                database	
  always	
                   system	
  
                consist	
  
Parallelism	
       Difficult	
  because	
  of	
         Basic	
  design	
  incorporates	
  
                    ACID	
  model;	
  shared	
         parallelism	
  over	
  
                    nothing	
  is	
  possible	
        commodity	
  components	
  	
  
Scale	
             Racks	
                            Data	
  center	
                   34	
  
Sec+on	
  2.3	
  	
  
Case	
  Study:	
  Project	
  Matsu	
  
Zoom	
  Levels	
  /	
  Bounds	
  
   Zoom	
  Level	
  1:	
  4	
  images	
      Zoom	
  Level	
  2:	
  16	
  images	
  




   Zoom	
  Level	
  3:	
  64	
  images	
     Zoom	
  Level	
  4:	
  256	
  images	
  

Source:	
  Andrew	
  Levine	
  
Build	
  Tile	
  Cache	
  in	
  the	
  Cloud	
  -­‐	
  Mapper	
  
 Mapper	
  Input	
  Key:	
  Bounding	
  Box	
                                                                                  Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
                              (minx	
  =	
  -­‐135.0	
  miny	
  =	
  45.0	
  maxx	
  =	
  -­‐112.5	
  maxy	
  =	
  67.5)	
     Mapper	
  Output	
  Value:	
  

 Mapper	
  Input	
  Value:	
                                                                                                   Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
                                                                                                                               Mapper	
  Output	
  Value:	
  

                                                                                                                               Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
                                                                                                                               Mapper	
  Output	
  Value:	
  

                                                                                                                               Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
Step	
  1:	
  Input	
  to	
  Mapper	
                                                                                          Mapper	
  Output	
  Value:	
  

Mapper	
  resizes	
  and/or	
  cuts	
  up	
  the	
  original	
                                                                 Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
image	
  into	
  pieces	
  to	
  output	
  Bounding	
  Boxes	
                                                                 Mapper	
  Output	
  Value:	
  

                                                                                                                               Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
                                                                                                                               Mapper	
  Output	
  Value:	
  

                                                                                                                               Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
                                                                                                                               Mapper	
  Output	
  Value:	
  
                                                                                                                               Mapper	
  Output	
  Key:	
  Bounding	
  Box	
  
                                                                                                                               Mapper	
  Output	
  Value:	
  

Step	
  2:	
  Processing	
  in	
  Mapper	
                                                                                     Step	
  3:	
  Mapper	
  Output	
  

   Source:	
  Andrew	
  Levine	
  
Build	
  Tile	
  Cache	
  in	
  the	
  Cloud	
  -­‐	
  Reducer	
  
Reducer	
  Key	
  Input:	
  Bounding	
  Box	
  
            (minx	
  =	
  -­‐45.0	
  miny	
  =	
  -­‐2.8125	
  maxx	
  =	
  -­‐43.59375	
  maxy	
  =	
  -­‐2.109375)	
  
Reducer	
  Value	
  Input:	
  
                                                                                                                           •  Output	
  to	
  HBase	
  
                                                                                         …	
                               •  Builds	
  up	
  Layers	
  
Step	
  1:	
  Input	
  to	
  Reducer	
  
                                                                                                                              for	
  WMS	
  for	
  
Assemble	
  Images	
  based	
  on	
  bounding	
  box	
                                                                        various	
  datasets	
  

Step	
  2:	
  Reducer	
  Output	
  




  Source:	
  Andrew	
  Levine	
  
HBase	
  Tables	
  
•  Open	
  Geospa+al	
  Consor+um	
  (OGC)	
  Web	
  
   Mapping	
  Service	
  (WMS)	
  Query	
  translates	
  to	
  
   HBase	
  scheme	
  
   –  Layers,	
  Styles,	
  Projec+on,	
  Size	
  
•  Table	
  name:	
  WMS	
  Layer	
  
   –  Row	
  ID:	
  Bounding	
  Box	
  of	
  image	
  
      -­‐Column	
  Family:	
  Style	
  Name	
  and	
  Projec+on	
  
      	
  	
  	
  -­‐Column	
  Qualifier:	
  Width	
  x	
  Height	
  
      	
  	
  	
  	
  	
  	
  -­‐Value:	
  Buffered	
  Image	
  
Sec+on	
  2.4	
  
Distributed	
  Key-­‐Value	
  Stores	
  



  S3	
  
PaCern	
  4:	
  Put	
  the	
  data	
  into	
  a	
  
 distributed	
  key-­‐value	
  store.	
  
S3	
  Buckets	
  
•  S3	
  bucket	
  names	
  must	
  be	
  unique	
  across	
  AWS	
  
•  A	
  good	
  prac+ce	
  is	
  to	
  use	
  a	
  paCern	
  like	
  
          	
   	
  tutorial.osdc.org/dataset1.txt	
  
       for	
  a	
  domain	
  you	
  own.	
  
•  The	
  file	
  is	
  then	
  referenced	
  as	
  
   	
  tutorial.osdc.org.s3.	
  amazonaws.com/
dataset1.txt	
  
•  If	
  you	
  own	
  osdc.org	
  you	
  can	
  create	
  a	
  DNS	
  
   CNAME	
  entry	
  to	
  access	
  the	
  file	
  as	
  
       tutorial.osdc.org/dataset1.txt	
  
S3	
  Keys	
  
•  Keys	
  must	
  be	
  unique	
  within	
  a	
  bucket.	
  
•  Values	
  can	
  be	
  as	
  large	
  as	
  5	
  TB	
  (formerly	
  5	
  GB)	
  
S3	
  Security	
  
•  AWS	
  access	
  key	
  (user	
  name)	
  
•  This	
  func+on	
  as	
  your	
  S3	
  username.	
  It	
  is	
  an	
  
   alphanumeric	
  text	
  string	
  that	
  uniquely	
  
   iden+fies	
  users.	
  	
  
•  AWS	
  Secret	
  key	
  (func+ons	
  as	
  password)	
  
AWS	
  Account	
  Informa+on	
  
Access	
  Keys	
  




User	
  Name	
       Password	
  
Other	
  Amazon	
  Data	
  Services	
  
•  Amazon	
  Simple	
  Database	
  Service	
  (SDS)	
  
•  Amazon’s	
  Elas+c	
  Block	
  Storage	
  (EBS)	
  
Sec+on	
  2.5	
  
Moving	
  Large	
  Data	
  Sets	
  
The	
  Basic	
  Problem	
  
•  TCP	
  was	
  never	
  designed	
  to	
  move	
  large	
  data	
  
   sets	
  over	
  wide	
  area	
  high	
  performance	
  
   networks.	
  
•  As	
  a	
  general	
  rule,	
  reading	
  data	
  off	
  disks	
  is	
  
   slower	
  than	
  transpor+ng	
  it	
  over	
  the	
  network.	
  	
  	
  
TCP Throughput vs RTT and Packet Loss
                                                              LAN           US         US-EU         US-ASIA
                                                                                                                        1000


                                                                                                                        800


                                                                                                                        600


                                                                                                                        400
      Throughput (Mb/s)




                                                                                                                        200
                          1000


                          800                                                                                       0.01%


                           600                                                                                  0.05%


                                                                                                         0.1%
                           400

                                                                                                  0.5%
                           200

                                                                                           0.1%

                                 1     10           100         200          400

                                     Round Trip Time (ms)
Source:	
  Yunhong	
  Gu,	
  	
  2007,	
  experiments	
  over	
  wide	
  area	
  1G.	
  
The	
  Solu+on	
  
•  Use	
  parallel	
  TCP	
  streams	
  
    –  GridFTP	
  
•  Use	
  specialized	
  network	
  protocols	
  
    –  UDT,	
  FAST,	
  etc.	
  
•  Use	
  RAID	
  to	
  stripe	
  data	
  across	
  disks	
  to	
  
   improve	
  throughput	
  when	
  reading	
  
•  These	
  techniques	
  are	
  well	
  understood	
  in	
  HEP,	
  
   astronomy,	
  but	
  not	
  yet	
  in	
  biology.	
  
Case	
  Study:	
  Bio-­‐mirror	
  




[The	
  open	
  source	
  GridFTP]	
  from	
  the	
  Globus	
  project	
  has	
  
recently	
  been	
  improved	
  to	
  offer	
  UDP-­‐based	
  file	
  transport,	
  
with	
  long-­‐distance	
  speed	
  improvements	
  of	
  3x	
  to	
  10x	
  over	
  
the	
  usual	
  TCP-­‐based	
  file	
  transport.	
  
	
  
-­‐-­‐	
  Don	
  Gilbert,	
  August	
  2010,	
  bio-­‐mirror.net	
  
Moving	
  113GB	
  of	
  Bio-­‐mirror	
  Data	
  
Site	
               RTT	
                 TCP	
                 UDT	
            TCP/UDT	
                 Km	
  
NCSA	
               10	
                  139	
                 139	
            1	
                       200	
  
Purdue	
             17	
                  125	
                 125	
            1	
                       500	
  
ORNL	
               25	
                  361	
                 120	
            3	
                       1,200	
  
TACC	
               37	
                  616	
                 120	
            55	
                      2,000	
  
SDSC	
               65	
                  750	
                 475	
            1.6	
                     3,300	
  
CSTNET	
             274	
                 3722	
                304	
            12	
                      12,000	
  

GridFTP	
  TCP	
  and	
  UDT	
  transfer	
  +mes	
  for	
  113	
  GB	
  from	
  gridip.bio-­‐mirror.net/biomirror/
blast/	
  (Indiana	
  USA).	
  	
  All	
  TCP	
  and	
  UDT	
  +mes	
  in	
  minutes.	
  	
  Source:	
  	
  hCp://gridip.bio-­‐
mirror.net/biomirror/	
  
Case	
  Study:	
  CGI	
  60	
  Genomes	
  




•  Trace	
  by	
  Complete	
  Genomics	
  showing	
  performance	
  of	
  
   moving	
  60	
  complete	
  human	
  genomes	
  from	
  Mountain	
  
   View	
  to	
  Chicago	
  using	
  the	
  open	
  source	
  Sector/UDT.	
  
•  Approximately	
  18	
  TB	
  at	
  about	
  0.5	
  Mbs	
  on	
  1G	
  link.	
  
Source:	
  Complete	
  Genomics.	
  	
  	
  
Resource	
  Use	
  


Protocol	
           CPU	
  Usage*	
                                                         Memory*	
  
GridFTP	
  (UDT)	
   1.0%	
  -­‐	
  3.0%	
  	
                                               40	
  Mb	
  
GridFTP	
  (TCP)	
   0.1%	
  -­‐	
  0.6%	
                                                   6	
  Mb	
  




*CPU	
  and	
  memory	
  usage	
  collected	
  by	
  	
  Don	
  Gilbert.	
  	
  	
  He	
  reports	
  that	
  rsync	
  uses	
  more	
  
CPU	
  than	
  GridFTP	
  with	
  UDT.	
  	
  	
  Source:	
  hCp://gridip.bio-­‐mirror.net/biomirror/.	
  
Sector/Sphere	
  




•  Sector/Sphere	
  is	
  a	
  pla{orm	
  for	
  data	
  intensive	
  
   compu+ng	
  built	
  over	
  UDT	
  and	
  designed	
  to	
  
   support	
  geographically	
  distributed	
  clusters.	
  	
  
Ques+ons?	
  

For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  see	
  
                         rgrossman.com	
  

Managing Big Data (Chapter 2, SC 11 Tutorial)

  • 1.
    An  Introduc+on  to     Data  Intensive  Compu+ng     Chapter  2:  Data  Management   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
  • 2.
    1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)   2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)   3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems   4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3.
    What  Are  the  Choices?   Applica+ons     File  Systems   (R,  SAS,  Excel,  etc.  )   Clustered   Databases     File  Systems   (SqlServer,  Oracle,  DB2)   (glusterfs,  …)   Distributed  File  Systems   NoSQL  Databases   (Hadoop,  Sector)   (HBase,  Accumulo,   Cassandra,  SimpleDB,  …)  
  • 4.
    What  is  the  Fundamental  Trade  Off?   vs   …   Scale  out   Scale  up  
  • 5.
  • 6.
    Advice  From  Jim  Gray   1.  Analyzing  big  data  requires   scale-­‐out  solu+ons  not  scale-­‐up   solu+ons  (GrayWulf)   2.  Move  the  analysis  to  the  data.   3.  Work  with  scien+sts  to  find  the   most  common  “20  queries”  and   make  them  fast.   4.  Go  from  “working  to  working.”  
  • 7.
    PaCern  1:  Put  the  metadata  in  a   database  and  point  to  files  in  a   file  system.    
  • 8.
    Example:  Sloan  Digital  Sky  Survey   •  Two  surveys  in  one   –  Photometric  survey  in  5  bands   –  Spectroscopic  redshii  survey   •  Data  is  public   –  40  TB  of  raw  data   –  5  TB  processed  catalogs   –  2.5  Terapixels  of  images   •  Catalog  uses  Microsoi  SQLServer   •  Started  in  1992,  finished  in  2008   •  JHU  SkyServer  serves  millions  of  queries    
  • 9.
    Example:  Bionimbus  Genomics  Cloud   www.bionimbus.org  
  • 10.
    GWT-­‐based  Front  End   U+lity  Cloud   Services   Analysis  Pipelines   Database   &  Re-­‐analysis   Services   Intercloud   Services   Services   Data   Inges+on   Data     Cloud  Services   Services  
  • 11.
    (Eucalyptus,   GWT-­‐based  Front  End   OpenStack)   Elas+c  Cloud   (PostgreSQL)   Services   Analysis  Pipelines   Database   &  Re-­‐analysis   Services   Intercloud   Services   Services   ID  Service   Data   (UDT,   Inges+on   Large  Data     replica+on)   Cloud  Services   Services   (Hadoop,   Sector/Sphere)  
  • 12.
    Sec+on  2.2   Distributed  File  Systems   Sector/Sphere  
  • 13.
    Hadoop’s  Large  Data  Cloud   Applica+ons   Compute  Services   Hadoop’s  MapReduce   Data  Services   NoSQL  Databases   Storage  Services   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  Stack   13
  • 14.
    PaCern  2:  Put  the  data  into  a   distributed  file  system.  
  • 15.
    Hadoop  Design   • Designed  to  run  over  commodity  components   that  fail.   •  Data  replicated,  typically  three  +mes.   •  Block-­‐based  storage.   •  Op+mized  for  efficient  scans  with  high   throughput,  not  low  latency  access.   •  Designed  for  write  once,  read  many.   •  Append  opera+on  planned  for  future.  
  • 16.
    Hadoop  Distributed  File  System  (HDFS)     Architecture   control   •  HDFS  is  block-­‐ Client   Name  Node   based.   •  WriCen  in  Java.   data   Data  Node   Data  Node   Data  Node   Data  Node   Data  Node   Data  Node   Rack   Rack   Rack  
  • 17.
    Sector  Distributed  File  System  (SDFS)   Architecture   •  Broadly  similar  to  Google  File  System  and   Hadoop  Distributed  File  System.   •  Uses  na+ve  file  system.    It  is  not  block  based.   •  Has  security  server  that  provides   authoriza+ons.   •  Has  mul+ple  master  name  servers  so  that   there  is  no  single  point  of  failure.   •  Use  UDT  to  support  wide  area  opera+ons.  
  • 18.
    Sector  Distributed  File  System  (SDFS)     control   Architecture   control   Master  Node   •  HDFS  is  file-­‐ Client   based.   Master  Node   •  WriCen  in  C++.   Security  Server   •  Security  server.   data   •  Mul+ple  masters.   Slave  Node   Slave  Node   Slave  Node   Slave  Node   Slave  Node   Slave  Node   Rack   Rack   Rack  
  • 19.
    GlusterFS  Architecture   •  No  metadata  server.   •  No  single  point  of  failure.   •  Uses  algorithms  to  determine  loca+on  of  data.   •  Can  scale  out  by  adding  more  bricks.  
  • 20.
    GlusterFS  Architecture   File-­‐based.   Client   GlusterFS  Server   data   Brick   Brick   Brick   Brick   Brick   Brick   Rack   Rack   Rack  
  • 21.
    Sec+on  2.3   NoSQL  Databases   21  
  • 22.
    Evolu+on   •  Standard  architecture  for  simple  web   applica+ons:   –  Presenta+on:  front-­‐end,  load  balanced  web  servers   –  Business  logic  layer     –  Backend  database   •  Database  layer  does  not  scale  with  large   numbers  of  users  or  large  amounts  of  data   •  Alterna+ves  arose   –  Sharded  (par++oned)  databases  or  master-­‐slave  dbs   –  memcache   22  
  • 23.
    Scaling  RDMS   • Master  –  slave  database  systems   –  Writes  to  master   –  Reads  from  slaves   –  Can  be  boClenecks  wri+ng  to  slaves;  can  be   inconsistent   •  Sharded  databases   –  Applica+ons  and  queries  must  understand  sharing   schema   –  Both  reads  and  writes  scale   –  No  na+ve,  direct  support  for  joins  across  shards   23  
  • 24.
    NoSQL  Systems   • Suggests  No  SQL  support,  also  Not  Only  SQL   •  One  or  more  of  the  ACID  proper+es  not   supported   •  Joins  generally  not  supported   •  Usually  flexible  schemas   •  Some  well  known  examples:  Google’s  BigTable,   Amazon’s  Dynamo  &  Facebook’s  Cassandra   •  Quite  a  few  recent  open  source  systems   24  
  • 25.
    PaCern  3:  Put  the  data  into  a   NoSQL  applica+on.  
  • 26.
  • 27.
    CAP  –  Choose  Two   Consistency   Per  Opera+on   C   CP:  always  consistent,  even  in  a   CA:  available  and   par++on,  but  a  reachable  replica   consistent,  unless  there   may  deny  service  without   is  a  par++on.   quorum.   BigTable,   HBase   Dynamo,  Cassandra     A   AP:  a  reachable  replica   P   Availability   provides  service  even  in  a   Par++on-­‐resiliency   par++on,  but  may  be   inconsistent.  
  • 28.
    CAP  Theorem   • Proposed  by  Eric  Brewer,  2000   •  Three  proper+es  of  a  system:  consistency,   availability  and  par++ons   •  You  can  have  at  most  two  of  these  three   proper+es  for  any  shared-­‐data  system   •  Scale  out  requires  par++ons   •  Most  large  web-­‐based  systems  choose   availability  over  consistency   Reference:  Brewer,  PODC  2000;  Gilbert/Lynch,  SIGACT  News  2002   28  
  • 29.
    Eventual  Consistency   • If  no  updates  occur  for  a  while,  all  updates   eventually  propagate  through  the  system  and   all  the  nodes  will  be  consistent   •  Eventually,  a  node  is  either  updated  or   removed  from  service.       •  Can  be  implemented  with  Gossip  protocol   •  Amazon’s  Dynamo  popularized  this  approach   •  Some+mes  this  is  called  BASE  (Basically   Available,  Soi  state,  Eventual  consistency),  as   opposed  to  ACID   29  
  • 30.
    Different  Types  of  NoSQL  Systems   •  Distributed  Key-­‐Value  Systems   –  Amazon’s  S3  Key-­‐Value  Store  (Dynamo)   –  Voldemort   –  Cassandra   •  Column-­‐based  Systems   –  BigTable   –  HBase   –  Cassandra   •  Document-­‐based  systems   –  CouchDB   30  
  • 31.
    Hbase  Architecture   Client   Client   Client   Client   Client   Java  Client   REST API HBaseMaster   HRegionServer   HRegionServer   HRegionServer   HRegionServer   HRegionServer   Disk   Disk   Disk   Disk   Source:  Raghu  Ramakrishnan  
  • 32.
    HRegion  Server   •  Records  par++oned  by  column  family  into  HStores   –  Each  HStore  contains  many  MapFiles   •  All  writes  to  HStore  applied  to  single  memcache   •  Reads  consult  MapFiles  and  memcache   •  Memcaches  flushed  as  MapFiles  (HDFS  files)  when  full   •  Compac+ons  limit  number  of  MapFiles   HRegionServer   writes   Memcache   Flush  to  disk   HStore   reads   MapFiles   Source:  Raghu  Ramakrishnan  
  • 33.
    Facebook’s  Cassandra   • Modeled  aier  BigTable’s  data  model   •  Modeled  aier  Dynamo’s  eventual  consistency   •  Peer  to  peer  storage  architecture  using   consistent  hashing  (Chord  hashing)   33  
  • 34.
    Databases   NoSQL  Systems   Scalability   100’s  TB   100’s  PB   Func+onality   Full  SQL-­‐based  queries,   Op+mized  access  to   including  joins   sorted  tables  (tables  with   single  keys)   Op+mized   Databases  op+mized   Clouds  op+mized  for   for  safe  writes   efficient  reads   Consistency   ACID  (Atomicity,   Eventual  consistency  –   model   Consistency,  Isola+on   updates  eventually   &  Durability)  –   propagate  through   database  always   system   consist   Parallelism   Difficult  because  of   Basic  design  incorporates   ACID  model;  shared   parallelism  over   nothing  is  possible   commodity  components     Scale   Racks   Data  center   34  
  • 35.
    Sec+on  2.3     Case  Study:  Project  Matsu  
  • 36.
    Zoom  Levels  /  Bounds   Zoom  Level  1:  4  images   Zoom  Level  2:  16  images   Zoom  Level  3:  64  images   Zoom  Level  4:  256  images   Source:  Andrew  Levine  
  • 37.
    Build  Tile  Cache  in  the  Cloud  -­‐  Mapper   Mapper  Input  Key:  Bounding  Box   Mapper  Output  Key:  Bounding  Box   (minx  =  -­‐135.0  miny  =  45.0  maxx  =  -­‐112.5  maxy  =  67.5)   Mapper  Output  Value:   Mapper  Input  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Step  1:  Input  to  Mapper   Mapper  Output  Value:   Mapper  resizes  and/or  cuts  up  the  original   Mapper  Output  Key:  Bounding  Box   image  into  pieces  to  output  Bounding  Boxes   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Step  2:  Processing  in  Mapper   Step  3:  Mapper  Output   Source:  Andrew  Levine  
  • 38.
    Build  Tile  Cache  in  the  Cloud  -­‐  Reducer   Reducer  Key  Input:  Bounding  Box   (minx  =  -­‐45.0  miny  =  -­‐2.8125  maxx  =  -­‐43.59375  maxy  =  -­‐2.109375)   Reducer  Value  Input:   •  Output  to  HBase   …   •  Builds  up  Layers   Step  1:  Input  to  Reducer   for  WMS  for   Assemble  Images  based  on  bounding  box   various  datasets   Step  2:  Reducer  Output   Source:  Andrew  Levine  
  • 39.
    HBase  Tables   • Open  Geospa+al  Consor+um  (OGC)  Web   Mapping  Service  (WMS)  Query  translates  to   HBase  scheme   –  Layers,  Styles,  Projec+on,  Size   •  Table  name:  WMS  Layer   –  Row  ID:  Bounding  Box  of  image   -­‐Column  Family:  Style  Name  and  Projec+on        -­‐Column  Qualifier:  Width  x  Height              -­‐Value:  Buffered  Image  
  • 40.
    Sec+on  2.4   Distributed  Key-­‐Value  Stores   S3  
  • 41.
    PaCern  4:  Put  the  data  into  a   distributed  key-­‐value  store.  
  • 42.
    S3  Buckets   • S3  bucket  names  must  be  unique  across  AWS   •  A  good  prac+ce  is  to  use  a  paCern  like      tutorial.osdc.org/dataset1.txt   for  a  domain  you  own.   •  The  file  is  then  referenced  as    tutorial.osdc.org.s3.  amazonaws.com/ dataset1.txt   •  If  you  own  osdc.org  you  can  create  a  DNS   CNAME  entry  to  access  the  file  as   tutorial.osdc.org/dataset1.txt  
  • 43.
    S3  Keys   • Keys  must  be  unique  within  a  bucket.   •  Values  can  be  as  large  as  5  TB  (formerly  5  GB)  
  • 44.
    S3  Security   • AWS  access  key  (user  name)   •  This  func+on  as  your  S3  username.  It  is  an   alphanumeric  text  string  that  uniquely   iden+fies  users.     •  AWS  Secret  key  (func+ons  as  password)  
  • 45.
  • 46.
    Access  Keys   User  Name   Password  
  • 47.
    Other  Amazon  Data  Services   •  Amazon  Simple  Database  Service  (SDS)   •  Amazon’s  Elas+c  Block  Storage  (EBS)  
  • 48.
    Sec+on  2.5   Moving  Large  Data  Sets  
  • 49.
    The  Basic  Problem   •  TCP  was  never  designed  to  move  large  data   sets  over  wide  area  high  performance   networks.   •  As  a  general  rule,  reading  data  off  disks  is   slower  than  transpor+ng  it  over  the  network.      
  • 50.
    TCP Throughput vsRTT and Packet Loss LAN US US-EU US-ASIA 1000 800 600 400 Throughput (Mb/s) 200 1000 800 0.01% 600 0.05% 0.1% 400 0.5% 200 0.1% 1 10 100 200 400 Round Trip Time (ms) Source:  Yunhong  Gu,    2007,  experiments  over  wide  area  1G.  
  • 51.
    The  Solu+on   • Use  parallel  TCP  streams   –  GridFTP   •  Use  specialized  network  protocols   –  UDT,  FAST,  etc.   •  Use  RAID  to  stripe  data  across  disks  to   improve  throughput  when  reading   •  These  techniques  are  well  understood  in  HEP,   astronomy,  but  not  yet  in  biology.  
  • 52.
    Case  Study:  Bio-­‐mirror   [The  open  source  GridFTP]  from  the  Globus  project  has   recently  been  improved  to  offer  UDP-­‐based  file  transport,   with  long-­‐distance  speed  improvements  of  3x  to  10x  over   the  usual  TCP-­‐based  file  transport.     -­‐-­‐  Don  Gilbert,  August  2010,  bio-­‐mirror.net  
  • 53.
    Moving  113GB  of  Bio-­‐mirror  Data   Site   RTT   TCP   UDT   TCP/UDT   Km   NCSA   10   139   139   1   200   Purdue   17   125   125   1   500   ORNL   25   361   120   3   1,200   TACC   37   616   120   55   2,000   SDSC   65   750   475   1.6   3,300   CSTNET   274   3722   304   12   12,000   GridFTP  TCP  and  UDT  transfer  +mes  for  113  GB  from  gridip.bio-­‐mirror.net/biomirror/ blast/  (Indiana  USA).    All  TCP  and  UDT  +mes  in  minutes.    Source:    hCp://gridip.bio-­‐ mirror.net/biomirror/  
  • 54.
    Case  Study:  CGI  60  Genomes   •  Trace  by  Complete  Genomics  showing  performance  of   moving  60  complete  human  genomes  from  Mountain   View  to  Chicago  using  the  open  source  Sector/UDT.   •  Approximately  18  TB  at  about  0.5  Mbs  on  1G  link.   Source:  Complete  Genomics.      
  • 55.
    Resource  Use   Protocol   CPU  Usage*   Memory*   GridFTP  (UDT)   1.0%  -­‐  3.0%     40  Mb   GridFTP  (TCP)   0.1%  -­‐  0.6%   6  Mb   *CPU  and  memory  usage  collected  by    Don  Gilbert.      He  reports  that  rsync  uses  more   CPU  than  GridFTP  with  UDT.      Source:  hCp://gridip.bio-­‐mirror.net/biomirror/.  
  • 56.
    Sector/Sphere   •  Sector/Sphere  is  a  pla{orm  for  data  intensive   compu+ng  built  over  UDT  and  designed  to   support  geographically  distributed  clusters.    
  • 57.
    Ques+ons?   For  the  most  current  version  of  these  notes,  see   rgrossman.com