Managing Big Data (Chapter 2, SC 11 Tutorial)


Published on

This is Chapter 2 of a tutorial that I gave at SC 11 on November 14, 2011.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Managing Big Data (Chapter 2, SC 11 Tutorial)

  1. 1. An  Introduc+on  to     Data  Intensive  Compu+ng    Chapter  2:  Data  Management   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
  2. 2. 1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)  2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)  3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems  4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  3. 3. What  Are  the  Choices?   Applica+ons     File  Systems   (R,  SAS,  Excel,  etc.  )   Clustered  Databases     File  Systems  (SqlServer,  Oracle,  DB2)   (glusterfs,  …)   Distributed  File  Systems  NoSQL  Databases   (Hadoop,  Sector)  (HBase,  Accumulo,  Cassandra,  SimpleDB,  …)  
  4. 4. What  is  the  Fundamental  Trade  Off?   vs   …   Scale  out   Scale  up  
  5. 5. 2.1    Databases  
  6. 6. Advice  From  Jim  Gray  1.  Analyzing  big  data  requires   scale-­‐out  solu+ons  not  scale-­‐up   solu+ons  (GrayWulf)  2.  Move  the  analysis  to  the  data.  3.  Work  with  scien+sts  to  find  the   most  common  “20  queries”  and   make  them  fast.  4.  Go  from  “working  to  working.”  
  7. 7. PaCern  1:  Put  the  metadata  in  a   database  and  point  to  files  in  a   file  system.    
  8. 8. Example:  Sloan  Digital  Sky  Survey  •  Two  surveys  in  one   –  Photometric  survey  in  5  bands   –  Spectroscopic  redshii  survey  •  Data  is  public   –  40  TB  of  raw  data   –  5  TB  processed  catalogs   –  2.5  Terapixels  of  images  •  Catalog  uses  Microsoi  SQLServer  •  Started  in  1992,  finished  in  2008  •  JHU  SkyServer  serves  millions  of  queries    
  9. 9. Example:  Bionimbus  Genomics  Cloud  
  10. 10. GWT-­‐based  Front  End   U+lity  Cloud   Services   Analysis  Pipelines  Database   &  Re-­‐analysis  Services   Intercloud   Services   Services   Data  Inges+on   Data     Cloud  Services   Services  
  11. 11. (Eucalyptus,   GWT-­‐based  Front  End   OpenStack)   Elas+c  Cloud  (PostgreSQL)   Services   Analysis  Pipelines   Database   &  Re-­‐analysis   Services   Intercloud   Services   Services  ID  Service   Data   (UDT,   Inges+on   Large  Data     replica+on)   Cloud  Services   Services   (Hadoop,   Sector/Sphere)  
  12. 12. Sec+on  2.2  Distributed  File  Systems   Sector/Sphere  
  13. 13. Hadoop’s  Large  Data  Cloud   Applica+ons   Compute  Services   Hadoop’s  MapReduce  Data  Services   NoSQL  Databases   Storage  Services   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  Stack   13
  14. 14. PaCern  2:  Put  the  data  into  a   distributed  file  system.  
  15. 15. Hadoop  Design  •  Designed  to  run  over  commodity  components   that  fail.  •  Data  replicated,  typically  three  +mes.  •  Block-­‐based  storage.  •  Op+mized  for  efficient  scans  with  high   throughput,  not  low  latency  access.  •  Designed  for  write  once,  read  many.  •  Append  opera+on  planned  for  future.  
  16. 16. Hadoop  Distributed  File  System  (HDFS)     Architecture   control   •  HDFS  is  block-­‐ Client   Name  Node   based.   •  WriCen  in  Java.   data   Data  Node   Data  Node   Data  Node   Data  Node   Data  Node   Data  Node   Rack   Rack   Rack  
  17. 17. Sector  Distributed  File  System  (SDFS)   Architecture  •  Broadly  similar  to  Google  File  System  and   Hadoop  Distributed  File  System.  •  Uses  na+ve  file  system.    It  is  not  block  based.  •  Has  security  server  that  provides   authoriza+ons.  •  Has  mul+ple  master  name  servers  so  that   there  is  no  single  point  of  failure.  •  Use  UDT  to  support  wide  area  opera+ons.  
  18. 18. Sector  Distributed  File  System  (SDFS)    control   Architecture   control   Master  Node   •  HDFS  is  file-­‐ Client   based.   Master  Node   •  WriCen  in  C++.   Security  Server   •  Security  server.   data   •  Mul+ple  masters.   Slave  Node   Slave  Node   Slave  Node   Slave  Node   Slave  Node   Slave  Node   Rack   Rack   Rack  
  19. 19. GlusterFS  Architecture  •  No  metadata  server.  •  No  single  point  of  failure.  •  Uses  algorithms  to  determine  loca+on  of  data.  •  Can  scale  out  by  adding  more  bricks.  
  20. 20. GlusterFS  Architecture   File-­‐based.  Client   GlusterFS  Server   data   Brick   Brick   Brick   Brick   Brick   Brick   Rack   Rack   Rack  
  21. 21. Sec+on  2.3  NoSQL  Databases   21  
  22. 22. Evolu+on  •  Standard  architecture  for  simple  web   applica+ons:   –  Presenta+on:  front-­‐end,  load  balanced  web  servers   –  Business  logic  layer     –  Backend  database  •  Database  layer  does  not  scale  with  large   numbers  of  users  or  large  amounts  of  data  •  Alterna+ves  arose   –  Sharded  (par++oned)  databases  or  master-­‐slave  dbs   –  memcache   22  
  23. 23. Scaling  RDMS  •  Master  –  slave  database  systems   –  Writes  to  master   –  Reads  from  slaves   –  Can  be  boClenecks  wri+ng  to  slaves;  can  be   inconsistent  •  Sharded  databases   –  Applica+ons  and  queries  must  understand  sharing   schema   –  Both  reads  and  writes  scale   –  No  na+ve,  direct  support  for  joins  across  shards   23  
  24. 24. NoSQL  Systems  •  Suggests  No  SQL  support,  also  Not  Only  SQL  •  One  or  more  of  the  ACID  proper+es  not   supported  •  Joins  generally  not  supported  •  Usually  flexible  schemas  •  Some  well  known  examples:  Google’s  BigTable,   Amazon’s  Dynamo  &  Facebook’s  Cassandra  •  Quite  a  few  recent  open  source  systems   24  
  25. 25. PaCern  3:  Put  the  data  into  a   NoSQL  applica+on.  
  26. 26. 26  
  27. 27. CAP  –  Choose  Two   Consistency  Per  Opera+on   C   CP:  always  consistent,  even  in  a   CA:  available  and   par++on,  but  a  reachable  replica   consistent,  unless  there   may  deny  service  without   is  a  par++on.   quorum.   BigTable,   HBase   Dynamo,  Cassandra     A   AP:  a  reachable  replica   P   Availability   provides  service  even  in  a   Par++on-­‐resiliency   par++on,  but  may  be   inconsistent.  
  28. 28. CAP  Theorem  •  Proposed  by  Eric  Brewer,  2000  •  Three  proper+es  of  a  system:  consistency,   availability  and  par++ons  •  You  can  have  at  most  two  of  these  three   proper+es  for  any  shared-­‐data  system  •  Scale  out  requires  par++ons  •  Most  large  web-­‐based  systems  choose   availability  over  consistency  Reference:  Brewer,  PODC  2000;  Gilbert/Lynch,  SIGACT  News  2002   28  
  29. 29. Eventual  Consistency  •  If  no  updates  occur  for  a  while,  all  updates   eventually  propagate  through  the  system  and   all  the  nodes  will  be  consistent  •  Eventually,  a  node  is  either  updated  or   removed  from  service.      •  Can  be  implemented  with  Gossip  protocol  •  Amazon’s  Dynamo  popularized  this  approach  •  Some+mes  this  is  called  BASE  (Basically   Available,  Soi  state,  Eventual  consistency),  as   opposed  to  ACID   29  
  30. 30. Different  Types  of  NoSQL  Systems  •  Distributed  Key-­‐Value  Systems   –  Amazon’s  S3  Key-­‐Value  Store  (Dynamo)   –  Voldemort   –  Cassandra  •  Column-­‐based  Systems   –  BigTable   –  HBase   –  Cassandra  •  Document-­‐based  systems   –  CouchDB   30  
  31. 31. Hbase  Architecture   Client   Client   Client   Client   Client   Java  Client   REST API HBaseMaster   HRegionServer   HRegionServer   HRegionServer   HRegionServer   HRegionServer   Disk   Disk   Disk   Disk  Source:  Raghu  Ramakrishnan  
  32. 32. HRegion  Server   •  Records  par++oned  by  column  family  into  HStores   –  Each  HStore  contains  many  MapFiles   •  All  writes  to  HStore  applied  to  single  memcache   •  Reads  consult  MapFiles  and  memcache   •  Memcaches  flushed  as  MapFiles  (HDFS  files)  when  full   •  Compac+ons  limit  number  of  MapFiles   HRegionServer   writes   Memcache   Flush  to  disk   HStore   reads   MapFiles  Source:  Raghu  Ramakrishnan  
  33. 33. Facebook’s  Cassandra  •  Modeled  aier  BigTable’s  data  model  •  Modeled  aier  Dynamo’s  eventual  consistency  •  Peer  to  peer  storage  architecture  using   consistent  hashing  (Chord  hashing)   33  
  34. 34. Databases   NoSQL  Systems  Scalability   100’s  TB   100’s  PB  Func+onality   Full  SQL-­‐based  queries,   Op+mized  access  to   including  joins   sorted  tables  (tables  with   single  keys)  Op+mized   Databases  op+mized   Clouds  op+mized  for   for  safe  writes   efficient  reads  Consistency   ACID  (Atomicity,   Eventual  consistency  –  model   Consistency,  Isola+on   updates  eventually   &  Durability)  –   propagate  through   database  always   system   consist  Parallelism   Difficult  because  of   Basic  design  incorporates   ACID  model;  shared   parallelism  over   nothing  is  possible   commodity  components    Scale   Racks   Data  center   34  
  35. 35. Sec+on  2.3    Case  Study:  Project  Matsu  
  36. 36. Zoom  Levels  /  Bounds   Zoom  Level  1:  4  images   Zoom  Level  2:  16  images   Zoom  Level  3:  64  images   Zoom  Level  4:  256  images  Source:  Andrew  Levine  
  37. 37. Build  Tile  Cache  in  the  Cloud  -­‐  Mapper   Mapper  Input  Key:  Bounding  Box   Mapper  Output  Key:  Bounding  Box   (minx  =  -­‐135.0  miny  =  45.0  maxx  =  -­‐112.5  maxy  =  67.5)   Mapper  Output  Value:   Mapper  Input  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box  Step  1:  Input  to  Mapper   Mapper  Output  Value:  Mapper  resizes  and/or  cuts  up  the  original   Mapper  Output  Key:  Bounding  Box  image  into  pieces  to  output  Bounding  Boxes   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:   Mapper  Output  Key:  Bounding  Box   Mapper  Output  Value:  Step  2:  Processing  in  Mapper   Step  3:  Mapper  Output   Source:  Andrew  Levine  
  38. 38. Build  Tile  Cache  in  the  Cloud  -­‐  Reducer  Reducer  Key  Input:  Bounding  Box   (minx  =  -­‐45.0  miny  =  -­‐2.8125  maxx  =  -­‐43.59375  maxy  =  -­‐2.109375)  Reducer  Value  Input:   •  Output  to  HBase   …   •  Builds  up  Layers  Step  1:  Input  to  Reducer   for  WMS  for  Assemble  Images  based  on  bounding  box   various  datasets  Step  2:  Reducer  Output   Source:  Andrew  Levine  
  39. 39. HBase  Tables  •  Open  Geospa+al  Consor+um  (OGC)  Web   Mapping  Service  (WMS)  Query  translates  to   HBase  scheme   –  Layers,  Styles,  Projec+on,  Size  •  Table  name:  WMS  Layer   –  Row  ID:  Bounding  Box  of  image   -­‐Column  Family:  Style  Name  and  Projec+on        -­‐Column  Qualifier:  Width  x  Height              -­‐Value:  Buffered  Image  
  40. 40. Sec+on  2.4  Distributed  Key-­‐Value  Stores   S3  
  41. 41. PaCern  4:  Put  the  data  into  a   distributed  key-­‐value  store.  
  42. 42. S3  Buckets  •  S3  bucket  names  must  be  unique  across  AWS  •  A  good  prac+ce  is  to  use  a  paCern  like   for  a  domain  you  own.  •  The  file  is  then  referenced  as  •  If  you  own  you  can  create  a  DNS   CNAME  entry  to  access  the  file  as  
  43. 43. S3  Keys  •  Keys  must  be  unique  within  a  bucket.  •  Values  can  be  as  large  as  5  TB  (formerly  5  GB)  
  44. 44. S3  Security  •  AWS  access  key  (user  name)  •  This  func+on  as  your  S3  username.  It  is  an   alphanumeric  text  string  that  uniquely   iden+fies  users.    •  AWS  Secret  key  (func+ons  as  password)  
  45. 45. AWS  Account  Informa+on  
  46. 46. Access  Keys  User  Name   Password  
  47. 47. Other  Amazon  Data  Services  •  Amazon  Simple  Database  Service  (SDS)  •  Amazon’s  Elas+c  Block  Storage  (EBS)  
  48. 48. Sec+on  2.5  Moving  Large  Data  Sets  
  49. 49. The  Basic  Problem  •  TCP  was  never  designed  to  move  large  data   sets  over  wide  area  high  performance   networks.  •  As  a  general  rule,  reading  data  off  disks  is   slower  than  transpor+ng  it  over  the  network.      
  50. 50. TCP Throughput vs RTT and Packet Loss LAN US US-EU US-ASIA 1000 800 600 400 Throughput (Mb/s) 200 1000 800 0.01% 600 0.05% 0.1% 400 0.5% 200 0.1% 1 10 100 200 400 Round Trip Time (ms)Source:  Yunhong  Gu,    2007,  experiments  over  wide  area  1G.  
  51. 51. The  Solu+on  •  Use  parallel  TCP  streams   –  GridFTP  •  Use  specialized  network  protocols   –  UDT,  FAST,  etc.  •  Use  RAID  to  stripe  data  across  disks  to   improve  throughput  when  reading  •  These  techniques  are  well  understood  in  HEP,   astronomy,  but  not  yet  in  biology.  
  52. 52. Case  Study:  Bio-­‐mirror  [The  open  source  GridFTP]  from  the  Globus  project  has  recently  been  improved  to  offer  UDP-­‐based  file  transport,  with  long-­‐distance  speed  improvements  of  3x  to  10x  over  the  usual  TCP-­‐based  file  transport.    -­‐-­‐  Don  Gilbert,  August  2010,  bio-­‐  
  53. 53. Moving  113GB  of  Bio-­‐mirror  Data  Site   RTT   TCP   UDT   TCP/UDT   Km  NCSA   10   139   139   1   200  Purdue   17   125   125   1   500  ORNL   25   361   120   3   1,200  TACC   37   616   120   55   2,000  SDSC   65   750   475   1.6   3,300  CSTNET   274   3722   304   12   12,000  GridFTP  TCP  and  UDT  transfer  +mes  for  113  GB  from­‐  (Indiana  USA).    All  TCP  and  UDT  +mes  in  minutes.    Source:    hCp://­‐  
  54. 54. Case  Study:  CGI  60  Genomes  •  Trace  by  Complete  Genomics  showing  performance  of   moving  60  complete  human  genomes  from  Mountain   View  to  Chicago  using  the  open  source  Sector/UDT.  •  Approximately  18  TB  at  about  0.5  Mbs  on  1G  link.  Source:  Complete  Genomics.      
  55. 55. Resource  Use  Protocol   CPU  Usage*   Memory*  GridFTP  (UDT)   1.0%  -­‐  3.0%     40  Mb  GridFTP  (TCP)   0.1%  -­‐  0.6%   6  Mb  *CPU  and  memory  usage  collected  by    Don  Gilbert.      He  reports  that  rsync  uses  more  CPU  than  GridFTP  with  UDT.      Source:  hCp://­‐  
  56. 56. Sector/Sphere  •  Sector/Sphere  is  a  pla{orm  for  data  intensive   compu+ng  built  over  UDT  and  designed  to   support  geographically  distributed  clusters.    
  57. 57. Ques+ons?  For  the  most  current  version  of  these  notes,  see