The road lies plain before
                                                                       me;--'tis a theme
                                                                          Single and of
                                                                 determined bounds; …
                                                            - Wordsworth, The Prelude

                                                              m
                                                    pre ss.co
                                             . word            ol
                                     bl eclix         te Scho
                            p:/ /dou          Gr adua            1
                  ka  r, htt        val Post             2 9,201
           n a San             r, Na                Nov
     Krish                 in a
                  st Sem
         hD   Gue
    00–P
EC40
What is
     Big
    Data ?	

                      Big
                     Data to
                     smart
                      data	

                                             Big
o  Agenda                                   Data
   o  To cover the broad                   Pipeline	

      picture
   o  Understand the
      waypoints &
   o  Drill down into one
      area (NOSQL)               Analytics/
                                 Modeling
                                                     Analytic     Storage -
                                     R
                                                    Algorithms     NOSQL	

   o  Can do others later
      …
                                                   Processing -
o  Of the Big Data              Visualization
                                                     Hadoop
                                                                     …

   domain …
Thanks to …
The giants whose
 shoulders I am
  standing on 




                                                                            Special	
  Thanks	
  to:	
  
                                                         	
  	
  	
  Peter	
  Ateshian,	
  NPS	
  
                               	
  	
  	
  Prof	
  Murali	
  Tummala,	
  NPS	
  
                                              	
  	
  	
  Shirley	
  Bailes,O’Reilly	
  
                                                               	
  	
  	
  Ed	
  Dumbill,O’Reilly	
  
                                                                                 	
  	
  	
  Jeff	
  Barr,AWS	
  
                   	
  	
  	
  Jenny	
  Kohr	
  Chynoweth,AWS	
  
When I think of my own native land, 
             In a moment I seem to be there;  

            But, alas! recollection at hand 
 
           Soon hurries me back to despair.
- Cowper, The Solitude Of Alexander SelKirk
What is Big Data ?
“Big data” is data                                              “Big data” is less
 that becomes large                                             about size, more
  enough that it                                            about flow & velocity
  cannot be processed                                              - persisting
 using conventional                                         petabytes per year is
 methods. @twitter	

                                                easier than
                                                             processing terabytes
                                                              per hour. @twitter	





                        Ref:	
  hIp://radar.oreilly.com/2010/09/the-­‐smaq-­‐stack-­‐for-­‐big-­‐data.html	
  
What is Big Data ?

                    Vinod Khosla’s Cool Dozen!
                                          Consumers : “Widespread innovation in
                                       technologies that reduce data overload for
                                                         users” ~ Data Reduction	

                                         Businesses : “Simple solutions to handle
                                       the deluge of data generated from various
                                               sources …” ~ Big Data Analytics  	

                                  TV	
  2.0,	
  EducaXon,	
  Social	
  NEXT,Tools	
  for	
  sharing	
  inteerst,Publishing,…	
  



                 Ref:	
  hIp://www.ciol.com/News/News/News-­‐Reports/Vinod-­‐Khosla%E2%80%99s-­‐cool-­‐dozen-­‐tech-­‐innovaXons/156307/0/	
  
hIp://yourstory.in/2011/11/vinod-­‐khoslas-­‐keynote-­‐at-­‐nasscom-­‐product-­‐conclave-­‐reject-­‐punditry-­‐believe-­‐in-­‐an-­‐idea-­‐take-­‐risk-­‐and-­‐succeed/	
  
EBC322	
  


  Volume	

o    Scale	
  
  Velocity	

o    Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
  Variety	

o    Different	
  sources	
  &	
  formats	
  
o    Structured	
  vs.	
  Unstructured	
  
  Variability	

o    Breadth	
  of	
  interpreta<on	
  &	
  
o    Depth	
  of	
  analy<cs	
  
  Contextual	

o    Dynamic	
  variability	
  
o    RecommendaXon	
  
  Connectedness	

                     hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                         hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
EBC322	
  


  Volume	

o    Scale	
  
  Velocity	

o    Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
  Variety	

o    Different	
  sources	
  &	
  formats	
  
o    Structured	
  vs.	
  Unstructured	
  
  Variability	

o    Breadth	
  of	
  interpreta<on	
  &	
  
o    Depth	
  of	
  analy<cs	
  
  Contextual	

o    Dynamic	
  variability	
  
o    RecommendaXon	
  
  Connectedness	

                     hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                         hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
EBC322	
  


  Volume	

o    Scale	
  
  Velocity	

o    Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
  Variety	

o    Different	
  sources	
  &	
  formats	
  
o    Structured	
  vs.	
  Unstructured	
  
  Variability	

o    Breadth	
  of	
  interpreta<on	
  &	
  
o    Depth	
  of	
  analy<cs	
  
  Contextual	

o    Dynamic	
  variability	
  
o    RecommendaXon	
  
  Connectedness	

                     hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                         hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
EBC322	
  


  Volume	

o    Scale	
  
  Velocity	

o    Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
  Variety	

o    Different	
  sources	
  &	
  formats	
  
o    Structured	
  vs.	
  Unstructured	
  
  Variability	

o    Breadth	
  of	
  interpreta<on	
  &	
  
o    Depth	
  of	
  analy<cs	
  
  Contextual	

o    Dynamic	
  variability	
  
o    RecommendaXon	
  
  Connectedness	

                     hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                         hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
EBC322	
  


  Volume	

o    Scale	
  
  Velocity	

o    Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
  Variety	

o    Different	
  sources	
  &	
  formats	
  
o    Structured	
  vs.	
  Unstructured	
  
  Variability	

o    Breadth	
  of	
  interpreta<on	
  &	
  
o    Depth	
  of	
  analy<cs	
  
  Contextual	

o    Dynamic	
  variability	
  
o    RecommendaXon	
  
  Connectedness	

                     hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                         hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
I.       Two	
  Main	
  Types	
  –	
  based	
  on	
  collecXon	
  
       i.      Big	
  Data	
  Streams	
  
              o       Data	
  in	
  “moXon”	
  
              o       TwiIer	
  fire	
  hose,	
  Facebook,	
  G+	
  	
  
       ii.     Big	
  Data	
  Logs	
  
              o       Data	
  “at	
  rest”	
  
              o       Logs,	
  DW,	
  external	
  market	
  data,	
  POS,	
  …	
  
II.      Typically,	
  Big	
  Data	
  has	
  a	
  non-­‐determinisXc	
  angle	
  as	
  well	
  …	
  
       o       CreaXve	
  Discovery	
  
       o       IteraXve,	
  Model	
  based	
  AnalyXcs	
  
       o       Explore	
  quesXons	
  to	
  ask	
  
III.  Smart	
  Data	
  =	
  Big	
  Data	
  +	
  context	
  +	
  embedded/interacXve	
  (inference,	
  
      reasoning)	
  models	
  
        o  Model	
  Driven	
  
        o  DeclaraXvely	
  InteracXve	
  

                                                                    hIp://www.slideshare.net/leonsp/hadoop-­‐slides-­‐11-­‐what-­‐is-­‐big-­‐data	
  
                                                                 hIp://www.slideshare.net/Dataversity/wed-­‐1550-­‐bacvanskivladimircolor	
  
AWS – 600 Billion
                                                                  objects!

Twitter	

  §      200 million tweets/day	

  §      Peak 10,000/second	

  §      How would you handle the fire
          hose for social network analytics 	

                                            ?
                                    Zynga	

                                        §      “Analytics company, not a
                                                gaming company!”	

                                        §      Harvests data : 15 TB/day	

Storage	

                                    §    Test new features	

    §     4 U box = 40 TB,	

                §    Target advertising	

           1 PB = 25 boxes !	

    § 
                                        §      230 million players/month	

                                                                      hIp://goo.gl/dcBsQ	
  
•  6	
  Billion	
  Messages	
  per	
  
   day	
  
•  2	
  PB	
  (w/compression)	
  
   online	
  
•  6	
  PB	
  w/	
  replicaXon	
  
•  250	
  TB/Month	
  growth	
  
•  HBase	
  Infrastructure	
  
50	
  TB/Day	
                                                        Very	
  systemaXc	
  
                                                   240	
  nodes,	
  84	
  PB	
               Diagram	
  speaks	
  volumes!	
  
Path	
  Analysis	
                                 Teradata	
  InstallaXon	
  
A/B	
  TesXng	
  
                                     Ref:	
  hIp://www.hpts.ws/sessions/2011HPTS-­‐TomFastner.pdf	
  
•  “…	
  they	
  didn’t	
  need	
  a	
  genius,	
  …	
  but	
  build	
  the	
  world’s	
  most	
  impressive	
  
      dileIante	
  …	
  baIling	
  the	
  efficient	
  human	
  mind	
  with	
  spectacular	
  
      flamboyant	
  inefficiency”	
  –	
  Final	
  Jeopardy	
  by	
  Stephen	
  Baker	
  
   •  15	
  TB	
  memory,	
  across	
  90	
  IBM	
  760	
  servers,	
  in	
  10	
  racks	
  
   •  1	
  TB	
  of	
  dataset	
  
   •  200	
  Million	
  pages	
  processed	
  by	
  Hadoop	
  
   •  This	
  is	
  a	
  good	
  example	
  of	
  Connected	
  data	
  
          –  Contextual	
  w/	
  variability	
  
          –  Breath	
  of	
  interpretaXon	
  
          –  AnalyXcs	
  depth	
  




hIp://doubleclix.wordpress.com/2011/03/01/the-­‐educaXon-­‐of-­‐a-­‐machine-­‐%E2%80%93-­‐review-­‐of-­‐book-­‐%E2%80%9Cfinal-­‐jeopardy
%E2%80%9D-­‐by-­‐stephen-­‐baker/	
  
hIp://doubleclix.wordpress.com/2011/02/17/watson-­‐at-­‐jeopardy-­‐a-­‐race-­‐of-­‐machines/	
  
Warehouse-­‐style	
  
 ApplicaXons	
  
                                                                                                          Block	
  Store	
  
 Distributed	
                                        Big Data
 ApplicaXons	
  
                                                                                Storage	
                 Object	
  Store	
  
                                           NOSQL	
  

               AnalyXcs	
                                                      Parallelism	
              Map/Reduce	
  

  Web	
                                                                      HPC	
  
AnalyXcs	
  
                                                                                              Cloud	
          Architecture	
  
                                               Social	
  Media	
  
   Log	
                                                                     Inference	
  
 AnalyXcs	
  
                              Social	
  	
                                                RecommendaXon/
                              Graph	
                                                     Inference	
  Engines	
  
                                                                                           Machine	
  
                Knowledge	
                                   Search,	
                    Learning	
              Mahout	
  
                  Graph	
                                     Indexing	
  
                                                                                          ClassificaXon,	
  Clustering	
  
“A towel is about the most massively useful thing an
     interstellar hitchhiker can have … any man who can
     hitch the length and breadth of the Galaxy, rough it …
     win through, and still know where his towel is, is clearly
     a man to be reckoned with.” 
                     - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. 
                                                 Published by Harmony Books in 1979




Big  Data  to  Smart  Data
Don’t  throw  away  
 1	
 any  data  !	


      Big  data  to  smart  data	
     Be  ready  for  different  
 2	
 ways  of  organizing  
     the  data	
•  summary




                                  h;p://goo.gl/fGw7r
Big  Data  Pipeline	


     If a problem has no solution, it is not a problem,
     but a fact, not to be solved but to be coped with,
     over time …
                                             - Peres’s Law
Big  Data  Pipeline	
•  Stages
   o    Collect
   o    Store
   o    Transform & Analyze
   o    Model & Reason
   o    Predict, Recommend & Visualize
•  Different systems have different characteristics
   o  Infrastructure optimization based in application/hardware
      attributes correlation (short term)
        •  Hadoop, Splunk, internal Dashboard
   o  Application performance trends (medium term)
        •  Analytics, Modeling,…
   o  Product Metrics
        •  Feature set vs. usage, what is important to users, stratification
        •  Modeling using R, Visualization layers like Tableau
Big Data Pipeline
                                                                                     Ref:h;p:goo.gl/Mm83k	

                                                                                               Infer-ability	


                                                                             Model	

                             Internal	
  
                                                                                                                  dashboards
                                                                                        Hand	
                    ,	
  Tableau	
  
                                            Context	

                                  coded	
                   	
  
                                                                                        Programs,	
  
                               Connectedness	

                                         R,	
  Mahout,	
  
                                                                                        …	
  
                                                          SQL,	
  	
                    	
  
                     Variety	

                           BI	
  Tools,	
  
                                                          Hadoop,	
  
                                                          Pig,	
  
                Variability	

 SQL	
                      Hive,	
  	
  
                                                          .NET	
  
                                 NOSQL,	
  
            Logs,	
                                       Dryad,	
  
  Velocity	

            Scribe,	
  
                                 HDFS,	
  
                                 XML,	
  
                                                          Various	
  
            Flume,	
                                      other	
  
                                 <iles,	
  …	
  
  Volume	

 Hadoop
                                 	
  
                                                          tools	
  
            …	
  




                Decomplexify!                      Contextualize!               Network!            Reason!         Infer!
Build to Fail - “It is working” is not binary	





The  NOSQL  !	


                        I AM monarch of all I survey;
                      My right there is none to dispute; 
 
                 From the centre all round to the sea 
                  I am lord of the fowl and the brute
           - Cowper, The Solitude Of Alexander SelKirk
Agenda
•  Opening Gambit
      –  NOSQL	
  :	
  Toil,	
  Tears	
  &	
  Sweat	
  !	
  
•  The Pragmas
      –  ABCs	
  of	
  NOSQL	
  [ACID,	
  BASE	
  &	
  CAP]	
  
•  The Mechanics
      –  Algorithmics	
  &	
  Mechanisms	
  (For	
  reference)	
  




Referenced Links @ http://doubleclix.wordpress.com/2010/06/20/nosql-talk-references/
What is NOSQL
                                                   Anyway ?
•  NOSQL	
  	
  !=	
  NoSQL	
  or	
  NOSQL	
  !=	
  (!SQL)	
  
•  NOSQL	
  =	
  Not	
  Only	
  SQL	
  
•  Can	
  be	
  traced	
  back	
  to	
  Eric	
  Evans[2]!	
  
      –  You	
  can	
  ask	
  him	
  during	
  the	
  ayernoon	
  session!	
  
•    Unfortunate	
  Name,	
  but	
  is	
  stuck	
  now	
  
•    Non	
  RelaXonal	
  could	
  have	
  been	
  beIer	
  
•    Usually	
  OperaXonal,	
  Definitely	
  Distributed	
  
•    NOSQL	
  has	
  certain	
  semanXcs	
  –	
  need	
  not	
  stay	
  that	
  way	
  
NOSQL	
  



   Key	
  Value	
        Column	
           Document	
             Graph	
  


  In-­‐memory	
         SimpleDB	
           CouchDB	
              Neo4j	
  

 Memcached	
             Google	
  
                                            MongoDB	
              FlockDB	
  
                        BigTable	
  
  Disk	
  Based	
  
                          HBase	
         Lotus	
  Domino	
     InfiniteGraph	
  
     Redis	
  
                       Cassandra	
              Riak	
  
Tokyo	
  Cabinet	
  

   Dynamo	
            HyperTable	
  


  Voldemort	
           Azure	
  TS	
                               Ref:	
  [22,51,52]	
  
When I think of my own native land,
                             In a moment I seem to be there;
                                But, alas! recollection at hand
                            Soon hurries me back to despair.
                 - Cowper, The Solitude Of Alexander SelKirk




NOSQL Tales from the field
WHAT WORKS
•  Designer Augmenting RDBMS with a Distributed key
   Value Store[40 : A good talk by Geir]
•  Invitation only designer brand sales
•  Limited inventory sales – start at 12:00, members have
   10 min to grab them. 500K mails every day
•  Keeps brand value, hidden from search
•  Interesting load properties
•  Each item a row in DB-BUY NOW reserves it
   –  Can't order more
•  Started out as a Rails app
   –  shared nothing
•  Narrow peaks – half of revenue
Christian Louboutin
                                   Effect


•  ½ amz for Louboutin
•  Use Voldemort
•  Inventory, Shopping Cart,
   Checkout
•  Partition by prod ID
•  Shared infrastructure – “fog”
   not “cloud’ - Joyent!
•  In-memory inventory
•  Not afraid of sale anymore!
           And SQL DBs are
           still relevant !
Typical NOSQL Example Bit.ly
•  Bit,ly URL shortening service, uses MongoDB
•  User, title, URL, hash, labels[I-5], sort by time
•  Scale – ~50M users, ~10K concurrent, ~1.25B shortens
   per month
•  Criteria:
   –  Simple, Zippy FAST, Very Flexible, Reasonable Durability, Low
      cost of ownership
•  Sharded by userid
•  New kind of “dictionary” a word repository, GPS for
   English – context, pronunciations, twitter … developer
   API
•  Characteristics[I-6,Tony Tam’s presentation]
   –  RO-centric, 10,000 reads for every write
   –  Hit a wall with MySQL (4B rows)
   –  MongoDB read was so good that memcached layer was not
      required
   –  MongoDB used 4 times MySQL storage
•  Another example :
   –  Voldemort – Unified Communications, IP-Phone data stored
      keyed off of phone number. Data relatively stable
Large Hadron Collider@CERN
•  DAS is part of giant data management
   enterprise (cms)
      –  Polygot Persistence (SQL + NOSQL, Mongo, Couch,
         memcache, HDFS, Luster, Oracle, mySQL, …)
•    Data Aggregation System [I-1,I-2,I-3,I-4]
      –  Uses MongoDB
      –  Distributed Model, 2-6 pb data
      –  Combine info. from different metadata sources, query
         without knowing their existence, user has domain
         knowledge – but shouldn’t deal with various formats,
         interfaces and query semantics
      –  DAS aggregates, caches and presents data as JSON
         documents – preserving security & integrity




                                            And SQL DBs are
                                            still relevant !
Scaling Twitter
• 
•  Digg
   –  RDBMS places burden on reads than writes[I-8]
   –  Looked at NOSQL, selected Cassandra
       •  Colum oriented, so more structure than key-value
•  Heard from noSQL Boston[http://twitter.com/
  #search?q=%23nosqllive]
   –  Baidu: 120 node HyperTable cluster managing
      600TB of data
   –  StumbleUpon uses HBase for Analytics
   –  Twitter’s Current Cassandra cluster: 45 nodes
•  Adob is a HBase shop          •  BBC is a CouchDB shop
  [I-10,I-11,2]                     [I-13]
•  Adobe SaaS Infrastructure – •  Sweet spot:
   tagging, content aggregation,     •  Multi-master, multi
   search, storage and so forth         datacenter replication
•  Dynamic schema & huge
   number of records[I-5]
•  40 million records in 2008 to
   1 billion with 50 ms response •  Interactive Mediums
•  NOSQL not mature in 2008,         •  Old data to CouchDB
   now good enough                   •  Thus free up DB to do
•  Prod Analytics:40 nodes,             work!
   largest has 100 nodes
•  Cloudkick is a Cassandra shop[I-12]
•  Cloudkick offers cloud management services
•  Store metrics data
•  Linear scalability for write load
•  Massive write performance
    •  Memory table & serial commit log
•  Low operational costs
•  Data Structure
     –  Metrics, Rolled-up data, Statuses at time slice : all indexed by
        timestamp
•  Guardian/UK
   –  Runs on Redis[I-14] !
   –  “Long-term The Guardian is looking
      towards the adoption of a schema-free
      database to sit alongside its Oracle
      database and is investigating CouchDB.
      … the relational database is now just a
      component in the overall data
      management story, alongside data
      caching, data stores, search engines
                                                And SQL DBs are
      etc.
                                                still relevant !
   –  NOSQL can increase performance of         "The evil that SQL
      relational data by offloading specific    DBs do lives after
      data and tasks                            them; the good is
                                                oft interred with
                                                their bones...",
NOSQL at Netflix
•  Netflix is fully in the cloud
•  Uses NOSQL across the globe
•  Customer Profiles, watchlog, usage logging (see next
   slide)
     –  No multi-record locking
•    No DBA !
•    Easier Schema Changes
•    Less complex, Highly Available data store
•    Joins happen in the applications




                                  http://www.hpts.ws/sessions/nosql-ecosystem.pdf
                                  http://www.hpts.ws/sessions/GlobalNetflixHPTS.pdf
21 NOSQL Themes
•  Web	
  Scale	
  
•  Scale	
  Incrementally/conXnuous	
  growth	
  
•  Oddly	
  shaped	
  &	
  exponenXally	
  connected	
  
•  Structure	
  data	
  as	
  it	
  will	
  be	
  used	
  –	
  i.e.	
  read,	
  query	
  
•  Know	
  your	
  queries/updates	
  in	
  advance[96],	
  but	
  you	
  can	
  change	
  
   them	
  later	
  
•  Compute	
  aIributes	
  at	
  run	
  Xme	
  
•  Create	
  a	
  few	
  large	
  enXXes	
  with	
  opXonal	
  parts	
  
      –  NormalizaXon	
  creates	
  many	
  small	
  enXXes	
  
•    Define	
  Schemas	
  in	
  models	
  (not	
  in	
  databases)	
  
•    Avoid	
  impedance	
  mismatch	
  
•    Narrow	
  down	
  &	
  solve	
  your	
  core	
  problem	
  
•    Solve	
  the	
  right	
  problem	
  with	
  the	
  right	
  tool	
  


                                                                                              Ref:	
  [I-­‐8]	
  
21 NOSQL Themes
•  ExisXng	
  soluXons	
  are	
  clunky[1]	
  (in	
  certain	
  situaXons)	
  
•  Scale	
  automaXcally,	
  “becoming	
  prohibiXvely	
  costly	
  (in	
  
   terms	
  of	
  manpower)	
  to	
  operate”	
  TwiIer[I-­‐9]	
  	
  
     •  DistribuXon	
  &	
  parXXoning	
  are	
  built-­‐in	
  NOSQL	
  
•  RDBMS	
  distribuXon	
  &	
  sharding	
  not	
  fun	
  and	
  is	
  expensive	
  
    –  Lose	
  most	
  funcXonality	
  along	
  the	
  way	
  
•  Data	
  at	
  the	
  center,	
  Flexible	
  schema,	
  Less	
  joins	
  
•  The	
  value	
  of	
  NOSQL	
  is	
  in	
  flexibility	
  as	
  much	
  as	
  it	
  is	
  in	
  “Big	
  
   Data”	
  
21 NOSQL Themes
•  Requirements[3]	
  
    –  Data	
  will	
  not	
  fit	
  in	
  one	
  node	
  
          •  And	
  so	
  need	
  data	
  parXXon/distribuXon	
  by	
  the	
  system	
  
    –  Nodes	
  will	
  fail,	
  but	
  data	
  needs	
  to	
  be	
  safe	
  –	
  replicaXon!	
  
    –  Low	
  latency	
  for	
  real-­‐Xme	
  use	
  
•  Data	
  Locality	
  
    –  Row	
  based	
  structures	
  will	
  need	
  to	
  read	
  whole	
  row,	
  
       even	
  for	
  a	
  column	
  
    –  Column	
  based	
  structures	
  need	
  to	
  scan	
  for	
  each	
  row	
  
•  SoluXon	
  :	
  Column	
  storage	
  with	
  Locality	
  	
  
    –  Keep	
  data	
  that	
  is	
  read	
  together,	
  don’t	
  read	
  what	
  you	
  
       don’t	
  care	
  
          •  For	
  example	
  friends	
  –	
  other	
  data	
  

                                                                                                Ref:	
  3	
  
ABCs of
 NOSQL -
  ACID,
 BASE &
  CAP
The woods are lovely, dark, and deep, 
          But I have promises to keep, 
       And miles to go before I sleep, 
       And miles to go before I sleep.
                               -Frost
CAP Principle
“CAP	
  Principle	
  →	
  	
  
        	
  Strong	
  Consistency,	
  	
  
        	
  High	
  Availability,	
  	
                Consistency
        	
  Par::on-­‐resilience:	
  	
  
Pick	
  at	
  most	
  2”[37]



                            Availability                                                 Partition




 Which	
  feature	
  to	
  discard	
  depends	
  on	
  the	
  nature	
  of	
  your	
  system[41]	
  
CAP Principle
“CAP	
  Principle	
  →	
  	
  
         	
  Strong	
  Consistency,	
  	
  
         	
  High	
  Availability,	
  	
  
                                                       Consistency
         	
  Par::on-­‐resilience:	
  	
  
Pick	
  at	
  most	
  2”[37]	
  
C-­‐A	
  No	
  P	
  →	
  Single	
  DB	
  
server,	
  no	
  network	
  par::on	
  



                            Availability                                                 Partition




 Which	
  feature	
  to	
  discard	
  depends	
  on	
  the	
  nature	
  of	
  your	
  system[41]	
  
CAP Principle
“CAP	
  Principle	
  →	
  	
  
        	
  Strong	
  Consistency,	
  	
  
        	
  High	
  Availability,	
  	
                 Consistency
        	
  Par::on-­‐resilience:	
  	
  
Pick	
  at	
  most	
  2”[37]	
  
                                                   C-­‐P	
  No	
  A	
  →	
  Block	
  
                                                   transac:on	
  in	
  
                                                   case	
  of	
  par::on	
  
                                                   failure	
  
                            Availability                                                 Partition




 Which	
  feature	
  to	
  discard	
  depends	
  on	
  the	
  nature	
  of	
  your	
  system[41]	
  
CAP Principle
                                                     Interesting (& controversial) from
“CAP	
  Principle	
  →	
  	
                         NOSQL perspective	


        	
  Strong	
  Consistency,	
  	
  
        	
  High	
  Availability,	
  	
        Consistency

        	
  Par::on-­‐resilience:	
  	
  
Pick	
  at	
  most	
  2”[37]	
            A-­‐P	
  No	
  C	
  →	
  
                                                Expira:on	
  based	
  
                                                caching,	
  vo:ng	
  
                                                majority	
  
                          Availability                                           Partition
ABCs	
  of	
  NOSQL	
  
•  ACID	
  
    o  Atomicity,	
  Consistency,	
  IsolaXon	
  &	
  Durability	
  –	
  
       fundamental	
  properXes	
  of	
  SQL	
  DBMS	
  
•  BASE[35,39]	
  
    o  Basically	
  Available	
  Soy	
  state(Scalable)	
  
       Eventually	
  Consistent	
  	
  
•  CAP[36,39]	
  
    o  Consistency,	
  Availability	
  &	
  ParXXoning	
  
    o  This	
  C	
  is	
  ~A+C	
  
         •  i.e.	
  Atomic	
  Consistency[36]	
  
ACID	
  
•  Atomicity	
  
    o  All	
  or	
  nothing	
  
•  Consistent	
  
    o  From	
  one	
  consistent	
  state	
  to	
  another	
  
          •  e.g.	
  ReferenXal	
  Integrity	
  
    o  But	
  it	
  is	
  also	
  applicaXon	
  dependent	
  on	
  	
  
          •  e.g.	
  min	
  account	
  balance	
  
          •  Predicates,	
  invariants,…	
  
•  IsolaXon	
  
•  Durability	
  
CAP	
  Pragmas	
  
•  PrecondiXons	
  
    o    The	
  domain	
  is	
  scalable	
  web	
  apps	
  
    o    Low	
  Latency	
  For	
  real	
  Xme	
  use	
  
    o    A	
  small	
  sub-­‐set	
  of	
  SQL	
  FuncXonality	
  
    o    Horizontal	
  Scaling	
  
•  PritcheI[35]	
  talks	
  about	
  relaxing	
  consistency	
  
   across	
  funcXonal	
  groups	
  than	
  within	
  funcXonal	
  
   groups	
  
•  Idempotency	
  to	
  consider	
  
    o  Updates	
  inc/dec	
  are	
  rarely	
  idempotent	
  
    o  Order	
  preserving	
  trx	
  are	
  not	
  idempotent	
  either	
  
    o  MVCC	
  is	
  an	
  answer	
  for	
  this	
  (CouchDB)	
  
Consistency	
  
•  Strict	
  Consistency	
  
   o Any	
  read	
  on	
  Data	
  X	
  will	
  return	
  the	
  most	
  
     recent	
  write	
  on	
  X[42]	
  
•  SequenXal	
  Consistency	
  
   o Maintains	
  sequenXal	
  order	
  from	
  
     mulXple	
  processes	
  (No	
  menXon	
  of	
  Xme)	
  
•  Linearizability	
  
   o Add	
  Xmestamp	
  from	
  loosely	
  
     synchronized	
  processes	
  
Consistency	
  
•  Write	
  availability,	
  not	
  read	
  availability[44]	
  
•  Even	
  load	
  distribuXon	
  is	
  easier	
  in	
  
   eventually	
  consistent	
  systems	
  
•  MulX-­‐data	
  center	
  support	
  is	
  easier	
  in	
  
   eventually	
  consistent	
  systems	
  
•  Some	
  problems	
  are	
  not	
  solvable	
  with	
  
   eventually	
  consistent	
  systems	
  
•  Code	
  is	
  someXmes	
  simpler	
  to	
  write	
  in	
  
   strongly	
  consistent	
  systems	
  
CAP	
  EssenXals	
  –	
  1	
  of	
  3	
  
•  “CAP	
  Principle	
  →	
  Strong	
  Consistency,	
  High	
  
   Availability,	
  ParXXon-­‐resilience:	
  Pick	
  at	
  
   most	
  2”[37]	
  
    o  C-­‐A	
  No	
  P	
  →	
  Single	
  DB	
  server,	
  no	
  network	
  
       parXXon	
  
    o  C-­‐P	
  No	
  A	
  →	
  Block	
  transacXon	
  in	
  case	
  of	
  
       parXXon	
  failure	
  
    o  A-­‐P	
  No	
  C	
  →	
  ExpiraXon	
  based	
  caching,	
  voXng	
  
       majority	
  
•  Which	
  feature	
  to	
  discard	
  depends	
  on	
  the	
  
   nature	
  of	
  your	
  system[41]	
  
CAP	
  EssenXals	
  –	
  2	
  of	
  3	
  
•  Yield	
  vs.	
  Harvest[37]	
  
    o  Yield	
  →	
  Probability	
  of	
  compleXng	
  a	
  request	
  
    o  Harvest	
  →	
  FracXon	
  of	
  data	
  reflected	
  in	
  the	
  
       response	
  
•  Some	
  systems	
  tolerate	
  <	
  100%	
  harvest	
  (e.g	
  
   search	
  i.e.	
  approximate	
  answers	
  OK)	
  
   others	
  need	
  100%	
  harvest	
  (e.g.	
  Trx	
  i.e.	
  
   correct	
  behavior	
  =	
  single	
  well	
  defined	
  
   response)	
  
•  For	
  sub-­‐systems	
  that	
  tolerate	
  harvest	
  
   degradaXon,	
  CAP	
  makes	
  sense	
  	
  	
  
CAP	
  EssenXals	
  –	
  3	
  of	
  3	
  
•  Trading	
  Harvest	
  for	
  yield	
  –	
  AP	
  
•  ApplicaXon	
  decomposiXon	
  &	
  use	
  NOSQL	
  in	
  
   appropriate	
  sub-­‐systems	
  that	
  has	
  state	
  
   management	
  and	
  data	
  semanXcs	
  that	
  match	
  the	
  
   opera<onal	
  feature	
  &	
  impedance	
  
    o    Hence	
  NotOnly	
  SQL	
  not	
  No	
  SQL	
  
    o    Intelligent	
  homing	
  to	
  tolerate	
  parXXon	
  failures[44]	
  
    o    MulX	
  zones	
  in	
  a	
  region	
  (150	
  miles	
  -­‐	
  5	
  ms)	
  
    o    TwiIer	
  tweets	
  in	
  Cassandra	
  &	
  MySQL	
  
    o    BBC	
  using	
  MongoDB	
  for	
  offloading	
  DBMS	
  
    o    Polygot	
  persistence	
  at	
  LHC@CERN	
  
CAP	
  EssenXals	
  –	
  3	
  of	
  3	
  
•  Trading	
  Harvest	
  for	
  yield	
  –	
  AP	
  
•  ApplicaXon	
  decomposiXon	
  &	
  use	
  NOSQL	
  in	
  
   appropriate	
  sub-­‐systems	
  that	
  has	
  state	
  
   management	
  and	
  data	
  semanXcs	
  that	
  match	
  the	
  
   opera<onal	
  feature	
  &	
  impedance	
  
    o    Hence	
  NotOnly	
  SQL	
  not	
  No	
  SQL	
  
    o    Intelligent	
  homing	
  to	
  tolerate	
  parXXon	
  failures[44]	
  
    o    MulX	
  zones	
  in	
  a	
  region	
  (150	
  miles	
  -­‐	
  5	
  ms)	
  
    o    TwiIer	
  tweets	
  in	
  Cassandra	
  and	
  MySQL	
  
                                                          Most important
    o    BBC	
  using	
  MongoDB	
  for	
  offloading	
  DBMS	
  
                                                        point in the whole
    o    Polygot	
  persistence	
  at	
  LHC@CERN	
  
                                                         presentation
Eventual	
  Consistency	
  &	
  AMZ	
  
•  DistribuXon	
  Transparency[38]	
  
•  Larger	
  distributed	
  systems,	
  network	
  
   parXXons	
  are	
  given	
  
•  Consistency	
  Models	
  
    o  Strong	
  
    o  Weak	
  
         •  Has	
  an	
  inconsistency	
  window	
  before	
  update	
  and	
  
            guaranteed	
  	
  view	
  
    o  Eventual	
  
         •  If	
  no	
  new	
  updates,	
  all	
  will	
  see	
  the	
  value,	
  eventually	
  
Eventual	
  Consistency	
  &	
  AMZ	
  
•  Guarantee	
  variaXons[38]	
  
   o Read-­‐Your-­‐writes	
  
   o Session	
  consistency	
  
   o Monotonic	
  Read	
  consistency	
  
          •  Access	
  will	
  not	
  return	
  previous	
  value	
  
   o Monotonic	
  Write	
  consistency	
  
          •  Serialize	
  write	
  by	
  the	
  same	
  process	
  
•  Guarantee	
  order	
  (vector	
  clocks,	
  
   mvcc)	
  
   o  Example	
  :	
  Amz	
  Cart	
  merger	
  (let	
  cart	
  add	
  even	
  with	
  parXal	
  
      failure)	
  
Eventual	
  Consistency	
  &	
  AMZ	
  -­‐	
  SimpleDB	
  
•  SimpleDB	
  strong	
  consistency	
  
   semanXcs	
  [49,50]	
  	
  
   o UnXl	
  Feb	
  2010,	
  SimpleDB	
  only	
  
     supported	
  eventual	
  consistency	
  i.e.	
  
     GetAIributes	
  ayer	
  PutAIributes	
  might	
  
     not	
  be	
  the	
  same	
  for	
  some	
  Xme	
  (1	
  
     second)	
  
   o On	
  Feb	
  24,	
  AWS	
  Added	
  
     ConsistentRead=True	
  aIribute	
  for	
  read	
  
   o Read	
  will	
  reflect	
  all	
  writes	
  that	
  got	
  
     200OK	
  Xll	
  that	
  Xme!	
  
Eventual	
  Consistency	
  &	
  AMZ	
  -­‐	
  SimpleDB	
  
•  SimpleDB	
  strong	
  consistency	
  
   semanXcs	
  [49,50]	
  	
  
   o Also	
  added	
  condiXonal	
  put/delete	
  
   o Put	
  aIribute	
  has	
  a	
  specified	
  value	
  
     (Expected.1.Value=)	
  or	
  (Expected.
     1.Exists	
  =	
  true/false)	
  
   o Same	
  condiXonal	
  check	
  capability	
  for	
  
     delete	
  also	
  
   o 	
  Only	
  on	
  one	
  aIribute	
  !	
  
Eventual	
  Consistency	
  &	
  AMZ	
  –	
  S3	
  
•  S3	
  is	
  an	
  eventual	
  consistency	
  system	
  
    o Versioning	
  
    o “S3	
  PUT	
  &	
  COPY	
  synchronously	
  store	
  
      data	
  across	
  mulXple	
  faciliXes	
  before	
  
      returning	
  SUCCESS”	
  
    o Repair	
  Lost	
  redundancy,	
  repair	
  bit-­‐rot	
  
    o Reduced	
  Redundancy	
  opXon	
  for	
  data	
  
      that	
  can	
  be	
  reproduced	
  
      (99.999999999%	
  	
  vs.	
  99.99%)	
  	
  
        •  Approx	
  1/3rd	
  less	
  
    o CloudFront	
  for	
  caching	
  
!SQL	
  ?	
  
•  “We	
  conclude	
  that	
  the	
  current	
  RDBMS	
  code	
  lines,	
  while	
  
   aIempXng	
  to	
  be	
  a	
  “one	
  size	
  fits	
  all”	
  soluXon,	
  in	
  fact,	
  excel	
  at	
  
   nothing.	
  Hence,	
  they	
  are	
  25	
  year	
  old	
  legacy	
  code	
  lines	
  that	
  
   should	
  be	
  reXred	
  in	
  favor	
  of	
  a	
  collecXon	
  of	
  “from	
  scratch”	
  
   specialized	
  engines.”[43]	
  
•  “Current	
  systems	
  were	
  built	
  in	
  an	
  era	
  where	
  resources	
  were	
  
   incredibly	
  expensive,	
  and	
  every	
  compuXng	
  system	
  was	
  
   watched	
  over	
  by	
  a	
  collecXon	
  of	
  wizards	
  in	
  white	
  lab	
  coats,	
  
   responsible	
  for	
  the	
  care,	
  feeding,	
  tuning	
  and	
  opXmizaXon	
  of	
  
   the	
  system.	
  In	
  that	
  era,	
  computers	
  were	
  expensive	
  and	
  
   people	
  were	
  cheap”	
  
•  “The	
  1970	
  -­‐	
  1985	
  period	
  was	
  a	
  <me	
  of	
  intense	
  debate,	
  a	
  
   myriad	
  of	
  ideas,	
  &	
  considerable	
  upheaval.	
  We	
  predict	
  the	
  
   next	
  fiUeen	
  years	
  will	
  have	
  the	
  same	
  feel	
  “	
  
Further	
  deliberaXon	
  
•  Daniel	
  Abadi[45],Mike	
  Stonebreaker[46],	
  
   James	
  Hamilton[47],	
  Pat	
  Hilland[48]	
  are	
  all	
  
   good	
  read	
  for	
  further	
  deliberaXons	
  
NOSQL Internals & Algorithmics
Caveats	
  
•  A	
  representaXve	
  subset	
  of	
  the	
  mechanics	
  and	
  
   mechanisms	
  used	
  in	
  the	
  NOSQL	
  world	
  
•  Being	
  refined	
  &	
  newer	
  ones	
  are	
  being	
  tried	
  
•  At	
  a	
  system	
  level	
  –	
  to	
  show	
  how	
  the	
  techniques	
  
   play	
  a	
  part	
  to	
  deliver	
  a	
  capability	
  
•  The	
  NOSQL	
  Papers	
  and	
  other	
  references	
  for	
  
   further	
  deliberaXon	
  
•  Even	
  if	
  we	
  don’t	
  cover	
  fully,	
  it	
  is	
  OK.	
  I	
  want	
  to	
  
   introduce	
  some	
  of	
  the	
  concepts	
  so	
  that	
  you	
  get	
  
   an	
  appreciaXon	
  …	
  
NOSQL	
  Mechanics	
  
•  Horizontal	
  Scalability	
          •  Performance	
  
    –  Gossip	
  (Cluster	
                –  SStables/memtables	
  
       membership)	
                       –  LSM	
  w/Bloom	
  Filter	
  
    –  Failure	
  DetecXon	
            •  Integrity/Version	
  
    –  Consistent	
  Hashing	
             reconcilia<on	
  
    –  ReplicaXon	
                        –    Timestamps	
  
       Techniques	
  
                                           –    Vector	
  Clocks	
  
         •  Hinted	
  Handoff	
  
         •  Merkle	
  Trees	
              –    MVCC	
  
    –  Sharding	
  MongoDB	
               –    SemanXc	
  vs.	
  syntacXc	
  
                                                reconciliaXon	
  
    –  Regions	
  in	
  HBase	
  	
  
Consistent	
  Hashing	
  
•  Origin:	
  web	
  caching	
  “To	
  decrease	
  ‘hot	
  
   spots’	
  
•  Three	
  goals[87]	
  
    –  Smooth	
  evoluXon	
  
        •  When	
  a	
  new	
  machine	
  joins,	
  minimum	
  rebalance	
  
           work	
  and	
  impact	
  
    –  Spread	
  
        •  Objects	
  assigned	
  to	
  a	
  min	
  number	
  of	
  nodes	
  
    –  Load	
  
        •  #	
  of	
  disXnct	
  objects	
  assigned	
  to	
  a	
  node	
  is	
  small	
  
Consistent	
  Hashing	
  
•  Hash	
  Keyspace/Token	
  is	
  divided	
  into	
  parXXons/ranges	
  
•  Cassandra	
  –	
  choice	
  	
  
      –  OrderPreserving	
  parXXoner	
  –	
  key	
  =	
  token	
  (for	
  range	
  queries)	
  
      –  Also	
  saw	
  a	
  CollaXngOrderPreservingParXXoner	
  
•  ParXXons	
  assigned	
  to	
  nodes	
  that	
  are	
  logically	
  arranged	
  in	
  a	
  circle	
  
   topology	
  

•  Amz	
  (dynamo)	
  –	
  assign	
  sets	
  of	
  
   (random)	
  mulXple	
  points	
  to	
  
   different	
  machines	
  depending	
  on	
  
   load	
  
•  Cassandra	
  –	
  monitor	
  load	
  &	
  
   distribute	
  
•  Specific	
  join	
  &	
  leave	
  protocols	
  
•  ReplicaXon	
  –	
  next	
  3	
  consecuXve	
  
•  Cassandra	
  –	
  Rack-­‐aware,	
  
   Datacenter-­‐aware	
  
Consistent	
  Hashing	
  -­‐	
  Hinted-­‐handoff	
  
•  What	
  happens	
  when	
  a	
  node	
  is	
  not	
  available	
  ?	
  
    –  May	
  be	
  under	
  load	
  
    –  May	
  be	
  network	
  parXXon	
  
•  Sloppy	
  Quorum	
  &	
  Hinted-­‐handoff	
  
•  R/W	
  performed	
  on	
  the	
  1st	
  n	
  healthy	
  nodes	
  
•  Replica	
  sent	
  to	
  a	
  host	
  node	
  with	
  hint	
  in	
  
   metadata	
  &	
  then	
  transferred	
  when	
  the	
  actual	
  
   node	
  is	
  up	
  
•  Burdens	
  neighboring	
  nodes	
  
•  Cassandra	
  0.6.2	
  default	
  is	
  disabled	
  (I	
  think)	
  
Consistent	
  Hashing	
  -­‐	
  ReplicaXon	
  
•  What	
  happens	
  when	
  a	
  new	
  node	
  
   joins	
  ?	
  
   – It	
  gets	
  one	
  or	
  more	
  parXXons	
  
   – Dynamo	
  :	
  Copy	
  the	
  whole	
  parXXon	
  
   – Cassandra	
  :	
  Replicate	
  keyset	
  
   – Cassandra	
  :	
  working	
  on	
  a	
  bit	
  torrent	
  
     type	
  protocol	
  to	
  copy	
  from	
  replicas	
  
AnX-­‐entropy	
  
•  Merge	
  and	
  reconciliaXon	
  operaXons	
  
    –  Operate	
  on	
  two	
  states	
  and	
  return	
  a	
  new	
  state[86]	
  
•  Merkle	
  Trees	
  
    –  Dynamo	
  use	
  of	
  Merkle	
  trees	
  to	
  detect	
  
       inconsistencies	
  between	
  replicas	
  
    –  AnXEntropy	
  in	
  Cassandra	
  exchanges	
  Merkle	
  trees	
  
       and	
  if	
  they	
  disagree,	
  range	
  repair	
  via	
  compacXon
       [91,92]	
  
    –  Cassandra	
  uses	
  the	
  ScuIlebuI	
  ReconciliaXon[86]	
  
Gossip	
  
•  Membership	
  &	
  Failure	
  detecXon	
  
•  Based	
  on	
  emergence	
  without	
  rigidity	
  –	
  
   pulse	
  coupled	
  oscillators,	
  biological	
  
   systems	
  like	
  fireflies	
  ![90]	
  



•  Also	
  used	
  for	
  state	
  propagaXon	
  
   –  Used	
  in	
  Dynamo/Cassandra	
  
Gossip	
  
•  Cassandra	
  exchanges	
  heartbeat	
  state,	
  applicaXon	
  state	
  
   and	
  so	
  forth	
  
•  Every	
  second,	
  random	
  live	
  node,	
  random	
  unreachable	
  
   node	
  and	
  exchanges	
  key-­‐value	
  structures	
  
•  Some	
  nodes	
  play	
  the	
  part	
  of	
  seeds	
  
•  Seed	
  /iniXal	
  contact	
  points	
  in	
  staXc	
  conf	
  file	
  
   storage.conf	
  file	
  
•  Could	
  also	
  come	
  from	
  a	
  configuraXon	
  service	
  like	
  
   zookeeper	
  
•  To	
  guard	
  against	
  node	
  flap,	
  explicit	
  membership	
  join	
  and	
  
   leave	
  –	
  now	
  you	
  know	
  why	
  hinted	
  handoff	
  was	
  added	
  	
  
Membership	
  &	
  Failure	
  detecXon	
  
•  Consensus	
  &	
  Atomic	
  Broadcast	
  	
  -­‐	
  impossible	
  to	
  
   solve	
  in	
  a	
  distributed	
  system[88,89]	
  
     –  Cannot	
  differenXate	
  between	
  an	
  slow	
  system	
  and	
  a	
  
        crashed	
  system	
  	
  
•  Completeness	
  
     –  Every	
  system	
  that	
  crashed	
  will	
  be	
  eventually	
  
        detected	
  
•  Correctness	
  
     –  A	
  correct	
  process	
  is	
  never	
  suspected	
  
•  In	
  short,	
  if	
  you	
  are	
  dead	
  somebody	
  will	
  no<ce	
  it	
  
   and	
  if	
  you	
  are	
  alive,	
  nobody	
  will	
  mistake	
  you	
  for	
  
   dead	
  !	
  
Ø	
  Accrual	
  Failure	
  Detector	
  
•  Not	
  	
  Boolean	
  value	
  but	
  a	
  probabilisXc	
  number	
  that	
  “accrues”	
  over	
  
   an	
  exponenXal	
  scale	
  
•  Captures	
  the	
  degree	
  of	
  confidence	
  that	
  a	
  corresponding	
  monitored	
  
   process	
  has	
  crashed[94]	
  
     –  Suspicion	
  Level	
  
     –  Ø	
  =	
  1	
  -­‐>	
  prob(error)	
  10%	
  
     –  Ø	
  =	
  2	
  -­‐>	
  prob(error)	
  1%	
  
     –  Ø	
  =	
  3	
  -­‐>	
  prob(error)	
  0.1%	
  
•  If	
  process	
  is	
  dead,	
  	
  
     –  Ø	
  is	
  monotonically	
  increasing	
  &	
  Ø→α	
  as	
  t	
  →α	
  
•  If	
  process	
  is	
  alive	
  and	
  kicking,	
  Ø=0	
  
•  Account	
  for	
  lost	
  messages,	
  network	
  latency	
  and	
  actual	
  crash	
  of	
  
   system/process	
  
•  Well	
  known	
  heartbeat	
  period	
  Δi,	
  then	
  network	
  latency	
  Δtr	
  can	
  be	
  
   tracked	
  by	
  inter-­‐arrival	
  Xme	
  modeling	
  
Write/Read	
  Mechanisms	
  
•  Read	
  &	
  Write	
  to	
  a	
  random	
  node	
  
   (StorageProxy)	
  
•  Proxy	
  coordinates	
  the	
  read	
  and	
  write	
  
   strategy	
  (R/W	
  =	
  any,	
  quorum	
  et	
  al)	
  
•  Memtables/SSTables	
  from	
  big	
  table	
  
•  Bloom	
  Filter/Index	
  
•  LSM	
  Trees	
  
Hbase – WAL,
   Node                Write                Memstore, HDFS File
                                            system

                           Commit
                           Logs
                                                         Node
                                                                  M
                                                                  e
                                                                  m
                                                                  o
                MemTable                                          r
                                                                  y
                                                      Read

           Flushing



            Index              Index             Index
                                                                  D
                                                                  i
                      BF               BF                 BF      s
                                                                  k
SSTable
• Immutable
• Compaction
• Maintain Index & Bloom Filter
How…	
  does	
  HBase	
  work	
  again?	
  




           http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
           http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/
Bloom	
  Filter	
  
•  The	
  BloomFilter	
  answers	
  the	
  quesXon	
  	
  
•  “Might	
  there	
  be	
  data	
  for	
  this	
  key	
  in	
  this	
  
   SSTable?”	
  [Ref:	
  Cassandra/Hbase	
  mailer]	
  
     –  “Maybe"	
  or	
  
     –  	
  “Definitely	
  not“	
  
     –  When	
  the	
  BloomFilter	
  says	
  "maybe"	
  we	
  have	
  to	
  go	
  to	
  
        disk	
  to	
  check	
  out	
  the	
  content	
  of	
  the	
  SSTable	
  
•  Depends	
  on	
  implementaXon	
  
     –  Redone	
  in	
  Cassandra	
  
     –  Hbase	
  0.20.x	
  removed,	
  will	
  be	
  back	
  in	
  0.90	
  with	
  a	
  
        “jazzy”	
  implementaXon	
  
Was it a vision, or a waking dream?
Fled is that music:—do I wake or sleep?
                 -Keats, Ode to a Nightingale
•    http://www.readwriteweb.com/enterprise/2011/11/infographic-data-
     deluge---8-ze.php
•    http://www.crn.com/news/data-center/232200061/efficiency-or-
     bust-data-centers-drive-for-low-power-solutions-prompts-channel-
     growth.htm
•    http://www.quantumforest.com/2011/11/do-we-need-to-deal-with-
     big-data-in-r/
•    http://www.forbes.com/special-report/2011/migration.html
•    http://www.mercurynews.com/bay-area-news/ci_19368103
•    http://www.businessinsider.com/apple-new-data-center-north-
     carolina-created-50-jobs-2011-11

The Art of Big Data

  • 1.
    The road liesplain before me;--'tis a theme Single and of determined bounds; … - Wordsworth, The Prelude m pre ss.co . word ol bl eclix te Scho p:/ /dou Gr adua 1 ka r, htt val Post 2 9,201 n a San r, Na Nov Krish in a st Sem hD Gue 00–P EC40
  • 2.
    What is Big Data ? Big Data to smart data Big o  Agenda Data o  To cover the broad Pipeline picture o  Understand the waypoints & o  Drill down into one area (NOSQL) Analytics/ Modeling Analytic Storage - R Algorithms NOSQL o  Can do others later … Processing - o  Of the Big Data Visualization Hadoop … domain …
  • 3.
    Thanks to … Thegiants whose shoulders I am standing on Special  Thanks  to:        Peter  Ateshian,  NPS        Prof  Murali  Tummala,  NPS        Shirley  Bailes,O’Reilly        Ed  Dumbill,O’Reilly        Jeff  Barr,AWS        Jenny  Kohr  Chynoweth,AWS  
  • 4.
    When I thinkof my own native land, In a moment I seem to be there; But, alas! recollection at hand Soon hurries me back to despair. - Cowper, The Solitude Of Alexander SelKirk
  • 5.
    What is BigData ? “Big data” is data “Big data” is less that becomes large about size, more enough that it about flow & velocity cannot be processed - persisting using conventional petabytes per year is methods. @twitter easier than processing terabytes per hour. @twitter Ref:  hIp://radar.oreilly.com/2010/09/the-­‐smaq-­‐stack-­‐for-­‐big-­‐data.html  
  • 6.
    What is BigData ? Vinod Khosla’s Cool Dozen!   Consumers : “Widespread innovation in technologies that reduce data overload for users” ~ Data Reduction   Businesses : “Simple solutions to handle the deluge of data generated from various sources …” ~ Big Data Analytics TV  2.0,  EducaXon,  Social  NEXT,Tools  for  sharing  inteerst,Publishing,…   Ref:  hIp://www.ciol.com/News/News/News-­‐Reports/Vinod-­‐Khosla%E2%80%99s-­‐cool-­‐dozen-­‐tech-­‐innovaXons/156307/0/   hIp://yourstory.in/2011/11/vinod-­‐khoslas-­‐keynote-­‐at-­‐nasscom-­‐product-­‐conclave-­‐reject-­‐punditry-­‐believe-­‐in-­‐an-­‐idea-­‐take-­‐risk-­‐and-­‐succeed/  
  • 7.
    EBC322     Volume o  Scale     Velocity o  Data  change  rate  vs.  decision  window     Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured     Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs     Contextual o  Dynamic  variability   o  RecommendaXon     Connectedness hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  
  • 8.
    EBC322     Volume o  Scale     Velocity o  Data  change  rate  vs.  decision  window     Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured     Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs     Contextual o  Dynamic  variability   o  RecommendaXon     Connectedness hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  
  • 9.
    EBC322     Volume o  Scale     Velocity o  Data  change  rate  vs.  decision  window     Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured     Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs     Contextual o  Dynamic  variability   o  RecommendaXon     Connectedness hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  
  • 10.
    EBC322     Volume o  Scale     Velocity o  Data  change  rate  vs.  decision  window     Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured     Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs     Contextual o  Dynamic  variability   o  RecommendaXon     Connectedness hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  
  • 11.
    EBC322     Volume o  Scale     Velocity o  Data  change  rate  vs.  decision  window     Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured     Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs     Contextual o  Dynamic  variability   o  RecommendaXon     Connectedness hIp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hIp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  
  • 12.
    I.  Two  Main  Types  –  based  on  collecXon   i.  Big  Data  Streams   o  Data  in  “moXon”   o  TwiIer  fire  hose,  Facebook,  G+     ii.  Big  Data  Logs   o  Data  “at  rest”   o  Logs,  DW,  external  market  data,  POS,  …   II.  Typically,  Big  Data  has  a  non-­‐determinisXc  angle  as  well  …   o  CreaXve  Discovery   o  IteraXve,  Model  based  AnalyXcs   o  Explore  quesXons  to  ask   III.  Smart  Data  =  Big  Data  +  context  +  embedded/interacXve  (inference,   reasoning)  models   o  Model  Driven   o  DeclaraXvely  InteracXve   hIp://www.slideshare.net/leonsp/hadoop-­‐slides-­‐11-­‐what-­‐is-­‐big-­‐data   hIp://www.slideshare.net/Dataversity/wed-­‐1550-­‐bacvanskivladimircolor  
  • 13.
    AWS – 600Billion objects! Twitter §  200 million tweets/day §  Peak 10,000/second §  How would you handle the fire hose for social network analytics ? Zynga §  “Analytics company, not a gaming company!” §  Harvests data : 15 TB/day Storage §  Test new features §  4 U box = 40 TB, §  Target advertising 1 PB = 25 boxes ! §  §  230 million players/month hIp://goo.gl/dcBsQ  
  • 14.
    •  6  Billion  Messages  per   day   •  2  PB  (w/compression)   online   •  6  PB  w/  replicaXon   •  250  TB/Month  growth   •  HBase  Infrastructure  
  • 15.
    50  TB/Day   Very  systemaXc   240  nodes,  84  PB   Diagram  speaks  volumes!   Path  Analysis   Teradata  InstallaXon   A/B  TesXng   Ref:  hIp://www.hpts.ws/sessions/2011HPTS-­‐TomFastner.pdf  
  • 16.
    •  “…  they  didn’t  need  a  genius,  …  but  build  the  world’s  most  impressive   dileIante  …  baIling  the  efficient  human  mind  with  spectacular   flamboyant  inefficiency”  –  Final  Jeopardy  by  Stephen  Baker   •  15  TB  memory,  across  90  IBM  760  servers,  in  10  racks   •  1  TB  of  dataset   •  200  Million  pages  processed  by  Hadoop   •  This  is  a  good  example  of  Connected  data   –  Contextual  w/  variability   –  Breath  of  interpretaXon   –  AnalyXcs  depth   hIp://doubleclix.wordpress.com/2011/03/01/the-­‐educaXon-­‐of-­‐a-­‐machine-­‐%E2%80%93-­‐review-­‐of-­‐book-­‐%E2%80%9Cfinal-­‐jeopardy %E2%80%9D-­‐by-­‐stephen-­‐baker/   hIp://doubleclix.wordpress.com/2011/02/17/watson-­‐at-­‐jeopardy-­‐a-­‐race-­‐of-­‐machines/  
  • 17.
    Warehouse-­‐style   ApplicaXons   Block  Store   Distributed   Big Data ApplicaXons   Storage   Object  Store   NOSQL   AnalyXcs   Parallelism   Map/Reduce   Web   HPC   AnalyXcs   Cloud   Architecture   Social  Media   Log   Inference   AnalyXcs   Social     RecommendaXon/ Graph   Inference  Engines   Machine   Knowledge   Search,   Learning   Mahout   Graph   Indexing   ClassificaXon,  Clustering  
  • 18.
    “A towel isabout the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. Published by Harmony Books in 1979 Big  Data  to  Smart  Data
  • 19.
    Don’t  throw  away  1 any  data  ! Big  data  to  smart  data Be  ready  for  different   2 ways  of  organizing   the  data •  summary h;p://goo.gl/fGw7r
  • 20.
    Big  Data  Pipeline If a problem has no solution, it is not a problem, but a fact, not to be solved but to be coped with, over time … - Peres’s Law
  • 21.
    Big  Data  Pipeline • Stages o  Collect o  Store o  Transform & Analyze o  Model & Reason o  Predict, Recommend & Visualize •  Different systems have different characteristics o  Infrastructure optimization based in application/hardware attributes correlation (short term) •  Hadoop, Splunk, internal Dashboard o  Application performance trends (medium term) •  Analytics, Modeling,… o  Product Metrics •  Feature set vs. usage, what is important to users, stratification •  Modeling using R, Visualization layers like Tableau
  • 22.
    Big Data Pipeline Ref:h;p:goo.gl/Mm83k Infer-ability Model Internal   dashboards Hand   ,  Tableau   Context coded     Programs,   Connectedness R,  Mahout,   …   SQL,       Variety BI  Tools,   Hadoop,   Pig,   Variability SQL   Hive,     .NET   NOSQL,   Logs,   Dryad,   Velocity Scribe,   HDFS,   XML,   Various   Flume,   other   <iles,  …   Volume Hadoop   tools   …   Decomplexify! Contextualize! Network! Reason! Infer!
  • 23.
    Build to Fail- “It is working” is not binary The  NOSQL  ! I AM monarch of all I survey; My right there is none to dispute; From the centre all round to the sea I am lord of the fowl and the brute - Cowper, The Solitude Of Alexander SelKirk
  • 24.
    Agenda •  Opening Gambit –  NOSQL  :  Toil,  Tears  &  Sweat  !   •  The Pragmas –  ABCs  of  NOSQL  [ACID,  BASE  &  CAP]   •  The Mechanics –  Algorithmics  &  Mechanisms  (For  reference)   Referenced Links @ http://doubleclix.wordpress.com/2010/06/20/nosql-talk-references/
  • 25.
    What is NOSQL Anyway ? •  NOSQL    !=  NoSQL  or  NOSQL  !=  (!SQL)   •  NOSQL  =  Not  Only  SQL   •  Can  be  traced  back  to  Eric  Evans[2]!   –  You  can  ask  him  during  the  ayernoon  session!   •  Unfortunate  Name,  but  is  stuck  now   •  Non  RelaXonal  could  have  been  beIer   •  Usually  OperaXonal,  Definitely  Distributed   •  NOSQL  has  certain  semanXcs  –  need  not  stay  that  way  
  • 26.
    NOSQL   Key  Value   Column   Document   Graph   In-­‐memory   SimpleDB   CouchDB   Neo4j   Memcached   Google   MongoDB   FlockDB   BigTable   Disk  Based   HBase   Lotus  Domino   InfiniteGraph   Redis   Cassandra   Riak   Tokyo  Cabinet   Dynamo   HyperTable   Voldemort   Azure  TS   Ref:  [22,51,52]  
  • 27.
    When I thinkof my own native land, In a moment I seem to be there; But, alas! recollection at hand Soon hurries me back to despair. - Cowper, The Solitude Of Alexander SelKirk NOSQL Tales from the field WHAT WORKS
  • 28.
    •  Designer AugmentingRDBMS with a Distributed key Value Store[40 : A good talk by Geir] •  Invitation only designer brand sales •  Limited inventory sales – start at 12:00, members have 10 min to grab them. 500K mails every day •  Keeps brand value, hidden from search •  Interesting load properties •  Each item a row in DB-BUY NOW reserves it –  Can't order more •  Started out as a Rails app –  shared nothing •  Narrow peaks – half of revenue
  • 29.
    Christian Louboutin Effect •  ½ amz for Louboutin •  Use Voldemort •  Inventory, Shopping Cart, Checkout •  Partition by prod ID •  Shared infrastructure – “fog” not “cloud’ - Joyent! •  In-memory inventory •  Not afraid of sale anymore! And SQL DBs are still relevant !
  • 30.
    Typical NOSQL ExampleBit.ly •  Bit,ly URL shortening service, uses MongoDB •  User, title, URL, hash, labels[I-5], sort by time •  Scale – ~50M users, ~10K concurrent, ~1.25B shortens per month •  Criteria: –  Simple, Zippy FAST, Very Flexible, Reasonable Durability, Low cost of ownership •  Sharded by userid
  • 31.
    •  New kindof “dictionary” a word repository, GPS for English – context, pronunciations, twitter … developer API •  Characteristics[I-6,Tony Tam’s presentation] –  RO-centric, 10,000 reads for every write –  Hit a wall with MySQL (4B rows) –  MongoDB read was so good that memcached layer was not required –  MongoDB used 4 times MySQL storage •  Another example : –  Voldemort – Unified Communications, IP-Phone data stored keyed off of phone number. Data relatively stable
  • 32.
    Large Hadron Collider@CERN • DAS is part of giant data management enterprise (cms) –  Polygot Persistence (SQL + NOSQL, Mongo, Couch, memcache, HDFS, Luster, Oracle, mySQL, …) •  Data Aggregation System [I-1,I-2,I-3,I-4] –  Uses MongoDB –  Distributed Model, 2-6 pb data –  Combine info. from different metadata sources, query without knowing their existence, user has domain knowledge – but shouldn’t deal with various formats, interfaces and query semantics –  DAS aggregates, caches and presents data as JSON documents – preserving security & integrity And SQL DBs are still relevant !
  • 33.
  • 34.
    •  Digg –  RDBMS places burden on reads than writes[I-8] –  Looked at NOSQL, selected Cassandra •  Colum oriented, so more structure than key-value •  Heard from noSQL Boston[http://twitter.com/ #search?q=%23nosqllive] –  Baidu: 120 node HyperTable cluster managing 600TB of data –  StumbleUpon uses HBase for Analytics –  Twitter’s Current Cassandra cluster: 45 nodes
  • 35.
    •  Adob isa HBase shop •  BBC is a CouchDB shop [I-10,I-11,2] [I-13] •  Adobe SaaS Infrastructure – •  Sweet spot: tagging, content aggregation, •  Multi-master, multi search, storage and so forth datacenter replication •  Dynamic schema & huge number of records[I-5] •  40 million records in 2008 to 1 billion with 50 ms response •  Interactive Mediums •  NOSQL not mature in 2008, •  Old data to CouchDB now good enough •  Thus free up DB to do •  Prod Analytics:40 nodes, work! largest has 100 nodes
  • 36.
    •  Cloudkick isa Cassandra shop[I-12] •  Cloudkick offers cloud management services •  Store metrics data •  Linear scalability for write load •  Massive write performance •  Memory table & serial commit log •  Low operational costs •  Data Structure –  Metrics, Rolled-up data, Statuses at time slice : all indexed by timestamp
  • 37.
    •  Guardian/UK –  Runs on Redis[I-14] ! –  “Long-term The Guardian is looking towards the adoption of a schema-free database to sit alongside its Oracle database and is investigating CouchDB. … the relational database is now just a component in the overall data management story, alongside data caching, data stores, search engines And SQL DBs are etc. still relevant ! –  NOSQL can increase performance of "The evil that SQL relational data by offloading specific DBs do lives after data and tasks them; the good is oft interred with their bones...",
  • 38.
    NOSQL at Netflix • Netflix is fully in the cloud •  Uses NOSQL across the globe •  Customer Profiles, watchlog, usage logging (see next slide) –  No multi-record locking •  No DBA ! •  Easier Schema Changes •  Less complex, Highly Available data store •  Joins happen in the applications http://www.hpts.ws/sessions/nosql-ecosystem.pdf http://www.hpts.ws/sessions/GlobalNetflixHPTS.pdf
  • 40.
    21 NOSQL Themes • Web  Scale   •  Scale  Incrementally/conXnuous  growth   •  Oddly  shaped  &  exponenXally  connected   •  Structure  data  as  it  will  be  used  –  i.e.  read,  query   •  Know  your  queries/updates  in  advance[96],  but  you  can  change   them  later   •  Compute  aIributes  at  run  Xme   •  Create  a  few  large  enXXes  with  opXonal  parts   –  NormalizaXon  creates  many  small  enXXes   •  Define  Schemas  in  models  (not  in  databases)   •  Avoid  impedance  mismatch   •  Narrow  down  &  solve  your  core  problem   •  Solve  the  right  problem  with  the  right  tool   Ref:  [I-­‐8]  
  • 41.
    21 NOSQL Themes • ExisXng  soluXons  are  clunky[1]  (in  certain  situaXons)   •  Scale  automaXcally,  “becoming  prohibiXvely  costly  (in   terms  of  manpower)  to  operate”  TwiIer[I-­‐9]     •  DistribuXon  &  parXXoning  are  built-­‐in  NOSQL   •  RDBMS  distribuXon  &  sharding  not  fun  and  is  expensive   –  Lose  most  funcXonality  along  the  way   •  Data  at  the  center,  Flexible  schema,  Less  joins   •  The  value  of  NOSQL  is  in  flexibility  as  much  as  it  is  in  “Big   Data”  
  • 42.
    21 NOSQL Themes • Requirements[3]   –  Data  will  not  fit  in  one  node   •  And  so  need  data  parXXon/distribuXon  by  the  system   –  Nodes  will  fail,  but  data  needs  to  be  safe  –  replicaXon!   –  Low  latency  for  real-­‐Xme  use   •  Data  Locality   –  Row  based  structures  will  need  to  read  whole  row,   even  for  a  column   –  Column  based  structures  need  to  scan  for  each  row   •  SoluXon  :  Column  storage  with  Locality     –  Keep  data  that  is  read  together,  don’t  read  what  you   don’t  care   •  For  example  friends  –  other  data   Ref:  3  
  • 43.
    ABCs of NOSQL- ACID, BASE & CAP The woods are lovely, dark, and deep, But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep. -Frost
  • 44.
    CAP Principle “CAP  Principle  →      Strong  Consistency,      High  Availability,     Consistency  Par::on-­‐resilience:     Pick  at  most  2”[37] Availability Partition Which  feature  to  discard  depends  on  the  nature  of  your  system[41]  
  • 45.
    CAP Principle “CAP  Principle  →      Strong  Consistency,      High  Availability,     Consistency  Par::on-­‐resilience:     Pick  at  most  2”[37]   C-­‐A  No  P  →  Single  DB   server,  no  network  par::on   Availability Partition Which  feature  to  discard  depends  on  the  nature  of  your  system[41]  
  • 46.
    CAP Principle “CAP  Principle  →      Strong  Consistency,      High  Availability,     Consistency  Par::on-­‐resilience:     Pick  at  most  2”[37]   C-­‐P  No  A  →  Block   transac:on  in   case  of  par::on   failure   Availability Partition Which  feature  to  discard  depends  on  the  nature  of  your  system[41]  
  • 47.
    CAP Principle Interesting (& controversial) from “CAP  Principle  →     NOSQL perspective  Strong  Consistency,      High  Availability,     Consistency  Par::on-­‐resilience:     Pick  at  most  2”[37]   A-­‐P  No  C  →   Expira:on  based   caching,  vo:ng   majority   Availability Partition
  • 48.
    ABCs  of  NOSQL   •  ACID   o  Atomicity,  Consistency,  IsolaXon  &  Durability  –   fundamental  properXes  of  SQL  DBMS   •  BASE[35,39]   o  Basically  Available  Soy  state(Scalable)   Eventually  Consistent     •  CAP[36,39]   o  Consistency,  Availability  &  ParXXoning   o  This  C  is  ~A+C   •  i.e.  Atomic  Consistency[36]  
  • 49.
    ACID   •  Atomicity   o  All  or  nothing   •  Consistent   o  From  one  consistent  state  to  another   •  e.g.  ReferenXal  Integrity   o  But  it  is  also  applicaXon  dependent  on     •  e.g.  min  account  balance   •  Predicates,  invariants,…   •  IsolaXon   •  Durability  
  • 50.
    CAP  Pragmas   • PrecondiXons   o  The  domain  is  scalable  web  apps   o  Low  Latency  For  real  Xme  use   o  A  small  sub-­‐set  of  SQL  FuncXonality   o  Horizontal  Scaling   •  PritcheI[35]  talks  about  relaxing  consistency   across  funcXonal  groups  than  within  funcXonal   groups   •  Idempotency  to  consider   o  Updates  inc/dec  are  rarely  idempotent   o  Order  preserving  trx  are  not  idempotent  either   o  MVCC  is  an  answer  for  this  (CouchDB)  
  • 51.
    Consistency   •  Strict  Consistency   o Any  read  on  Data  X  will  return  the  most   recent  write  on  X[42]   •  SequenXal  Consistency   o Maintains  sequenXal  order  from   mulXple  processes  (No  menXon  of  Xme)   •  Linearizability   o Add  Xmestamp  from  loosely   synchronized  processes  
  • 52.
    Consistency   •  Write  availability,  not  read  availability[44]   •  Even  load  distribuXon  is  easier  in   eventually  consistent  systems   •  MulX-­‐data  center  support  is  easier  in   eventually  consistent  systems   •  Some  problems  are  not  solvable  with   eventually  consistent  systems   •  Code  is  someXmes  simpler  to  write  in   strongly  consistent  systems  
  • 53.
    CAP  EssenXals  –  1  of  3   •  “CAP  Principle  →  Strong  Consistency,  High   Availability,  ParXXon-­‐resilience:  Pick  at   most  2”[37]   o  C-­‐A  No  P  →  Single  DB  server,  no  network   parXXon   o  C-­‐P  No  A  →  Block  transacXon  in  case  of   parXXon  failure   o  A-­‐P  No  C  →  ExpiraXon  based  caching,  voXng   majority   •  Which  feature  to  discard  depends  on  the   nature  of  your  system[41]  
  • 54.
    CAP  EssenXals  –  2  of  3   •  Yield  vs.  Harvest[37]   o  Yield  →  Probability  of  compleXng  a  request   o  Harvest  →  FracXon  of  data  reflected  in  the   response   •  Some  systems  tolerate  <  100%  harvest  (e.g   search  i.e.  approximate  answers  OK)   others  need  100%  harvest  (e.g.  Trx  i.e.   correct  behavior  =  single  well  defined   response)   •  For  sub-­‐systems  that  tolerate  harvest   degradaXon,  CAP  makes  sense      
  • 55.
    CAP  EssenXals  –  3  of  3   •  Trading  Harvest  for  yield  –  AP   •  ApplicaXon  decomposiXon  &  use  NOSQL  in   appropriate  sub-­‐systems  that  has  state   management  and  data  semanXcs  that  match  the   opera<onal  feature  &  impedance   o  Hence  NotOnly  SQL  not  No  SQL   o  Intelligent  homing  to  tolerate  parXXon  failures[44]   o  MulX  zones  in  a  region  (150  miles  -­‐  5  ms)   o  TwiIer  tweets  in  Cassandra  &  MySQL   o  BBC  using  MongoDB  for  offloading  DBMS   o  Polygot  persistence  at  LHC@CERN  
  • 56.
    CAP  EssenXals  –  3  of  3   •  Trading  Harvest  for  yield  –  AP   •  ApplicaXon  decomposiXon  &  use  NOSQL  in   appropriate  sub-­‐systems  that  has  state   management  and  data  semanXcs  that  match  the   opera<onal  feature  &  impedance   o  Hence  NotOnly  SQL  not  No  SQL   o  Intelligent  homing  to  tolerate  parXXon  failures[44]   o  MulX  zones  in  a  region  (150  miles  -­‐  5  ms)   o  TwiIer  tweets  in  Cassandra  and  MySQL   Most important o  BBC  using  MongoDB  for  offloading  DBMS   point in the whole o  Polygot  persistence  at  LHC@CERN   presentation
  • 57.
    Eventual  Consistency  &  AMZ   •  DistribuXon  Transparency[38]   •  Larger  distributed  systems,  network   parXXons  are  given   •  Consistency  Models   o  Strong   o  Weak   •  Has  an  inconsistency  window  before  update  and   guaranteed    view   o  Eventual   •  If  no  new  updates,  all  will  see  the  value,  eventually  
  • 58.
    Eventual  Consistency  &  AMZ   •  Guarantee  variaXons[38]   o Read-­‐Your-­‐writes   o Session  consistency   o Monotonic  Read  consistency   •  Access  will  not  return  previous  value   o Monotonic  Write  consistency   •  Serialize  write  by  the  same  process   •  Guarantee  order  (vector  clocks,   mvcc)   o  Example  :  Amz  Cart  merger  (let  cart  add  even  with  parXal   failure)  
  • 59.
    Eventual  Consistency  &  AMZ  -­‐  SimpleDB   •  SimpleDB  strong  consistency   semanXcs  [49,50]     o UnXl  Feb  2010,  SimpleDB  only   supported  eventual  consistency  i.e.   GetAIributes  ayer  PutAIributes  might   not  be  the  same  for  some  Xme  (1   second)   o On  Feb  24,  AWS  Added   ConsistentRead=True  aIribute  for  read   o Read  will  reflect  all  writes  that  got   200OK  Xll  that  Xme!  
  • 60.
    Eventual  Consistency  &  AMZ  -­‐  SimpleDB   •  SimpleDB  strong  consistency   semanXcs  [49,50]     o Also  added  condiXonal  put/delete   o Put  aIribute  has  a  specified  value   (Expected.1.Value=)  or  (Expected. 1.Exists  =  true/false)   o Same  condiXonal  check  capability  for   delete  also   o   Only  on  one  aIribute  !  
  • 61.
    Eventual  Consistency  &  AMZ  –  S3   •  S3  is  an  eventual  consistency  system   o Versioning   o “S3  PUT  &  COPY  synchronously  store   data  across  mulXple  faciliXes  before   returning  SUCCESS”   o Repair  Lost  redundancy,  repair  bit-­‐rot   o Reduced  Redundancy  opXon  for  data   that  can  be  reproduced   (99.999999999%    vs.  99.99%)     •  Approx  1/3rd  less   o CloudFront  for  caching  
  • 62.
    !SQL  ?   • “We  conclude  that  the  current  RDBMS  code  lines,  while   aIempXng  to  be  a  “one  size  fits  all”  soluXon,  in  fact,  excel  at   nothing.  Hence,  they  are  25  year  old  legacy  code  lines  that   should  be  reXred  in  favor  of  a  collecXon  of  “from  scratch”   specialized  engines.”[43]   •  “Current  systems  were  built  in  an  era  where  resources  were   incredibly  expensive,  and  every  compuXng  system  was   watched  over  by  a  collecXon  of  wizards  in  white  lab  coats,   responsible  for  the  care,  feeding,  tuning  and  opXmizaXon  of   the  system.  In  that  era,  computers  were  expensive  and   people  were  cheap”   •  “The  1970  -­‐  1985  period  was  a  <me  of  intense  debate,  a   myriad  of  ideas,  &  considerable  upheaval.  We  predict  the   next  fiUeen  years  will  have  the  same  feel  “  
  • 63.
    Further  deliberaXon   • Daniel  Abadi[45],Mike  Stonebreaker[46],   James  Hamilton[47],  Pat  Hilland[48]  are  all   good  read  for  further  deliberaXons  
  • 64.
    NOSQL Internals &Algorithmics
  • 65.
    Caveats   •  A  representaXve  subset  of  the  mechanics  and   mechanisms  used  in  the  NOSQL  world   •  Being  refined  &  newer  ones  are  being  tried   •  At  a  system  level  –  to  show  how  the  techniques   play  a  part  to  deliver  a  capability   •  The  NOSQL  Papers  and  other  references  for   further  deliberaXon   •  Even  if  we  don’t  cover  fully,  it  is  OK.  I  want  to   introduce  some  of  the  concepts  so  that  you  get   an  appreciaXon  …  
  • 66.
    NOSQL  Mechanics   • Horizontal  Scalability   •  Performance   –  Gossip  (Cluster   –  SStables/memtables   membership)   –  LSM  w/Bloom  Filter   –  Failure  DetecXon   •  Integrity/Version   –  Consistent  Hashing   reconcilia<on   –  ReplicaXon   –  Timestamps   Techniques   –  Vector  Clocks   •  Hinted  Handoff   •  Merkle  Trees   –  MVCC   –  Sharding  MongoDB   –  SemanXc  vs.  syntacXc   reconciliaXon   –  Regions  in  HBase    
  • 67.
    Consistent  Hashing   • Origin:  web  caching  “To  decrease  ‘hot   spots’   •  Three  goals[87]   –  Smooth  evoluXon   •  When  a  new  machine  joins,  minimum  rebalance   work  and  impact   –  Spread   •  Objects  assigned  to  a  min  number  of  nodes   –  Load   •  #  of  disXnct  objects  assigned  to  a  node  is  small  
  • 68.
    Consistent  Hashing   • Hash  Keyspace/Token  is  divided  into  parXXons/ranges   •  Cassandra  –  choice     –  OrderPreserving  parXXoner  –  key  =  token  (for  range  queries)   –  Also  saw  a  CollaXngOrderPreservingParXXoner   •  ParXXons  assigned  to  nodes  that  are  logically  arranged  in  a  circle   topology   •  Amz  (dynamo)  –  assign  sets  of   (random)  mulXple  points  to   different  machines  depending  on   load   •  Cassandra  –  monitor  load  &   distribute   •  Specific  join  &  leave  protocols   •  ReplicaXon  –  next  3  consecuXve   •  Cassandra  –  Rack-­‐aware,   Datacenter-­‐aware  
  • 69.
    Consistent  Hashing  -­‐  Hinted-­‐handoff   •  What  happens  when  a  node  is  not  available  ?   –  May  be  under  load   –  May  be  network  parXXon   •  Sloppy  Quorum  &  Hinted-­‐handoff   •  R/W  performed  on  the  1st  n  healthy  nodes   •  Replica  sent  to  a  host  node  with  hint  in   metadata  &  then  transferred  when  the  actual   node  is  up   •  Burdens  neighboring  nodes   •  Cassandra  0.6.2  default  is  disabled  (I  think)  
  • 70.
    Consistent  Hashing  -­‐  ReplicaXon   •  What  happens  when  a  new  node   joins  ?   – It  gets  one  or  more  parXXons   – Dynamo  :  Copy  the  whole  parXXon   – Cassandra  :  Replicate  keyset   – Cassandra  :  working  on  a  bit  torrent   type  protocol  to  copy  from  replicas  
  • 71.
    AnX-­‐entropy   •  Merge  and  reconciliaXon  operaXons   –  Operate  on  two  states  and  return  a  new  state[86]   •  Merkle  Trees   –  Dynamo  use  of  Merkle  trees  to  detect   inconsistencies  between  replicas   –  AnXEntropy  in  Cassandra  exchanges  Merkle  trees   and  if  they  disagree,  range  repair  via  compacXon [91,92]   –  Cassandra  uses  the  ScuIlebuI  ReconciliaXon[86]  
  • 72.
    Gossip   •  Membership  &  Failure  detecXon   •  Based  on  emergence  without  rigidity  –   pulse  coupled  oscillators,  biological   systems  like  fireflies  ![90]   •  Also  used  for  state  propagaXon   –  Used  in  Dynamo/Cassandra  
  • 73.
    Gossip   •  Cassandra  exchanges  heartbeat  state,  applicaXon  state   and  so  forth   •  Every  second,  random  live  node,  random  unreachable   node  and  exchanges  key-­‐value  structures   •  Some  nodes  play  the  part  of  seeds   •  Seed  /iniXal  contact  points  in  staXc  conf  file   storage.conf  file   •  Could  also  come  from  a  configuraXon  service  like   zookeeper   •  To  guard  against  node  flap,  explicit  membership  join  and   leave  –  now  you  know  why  hinted  handoff  was  added    
  • 74.
    Membership  &  Failure  detecXon   •  Consensus  &  Atomic  Broadcast    -­‐  impossible  to   solve  in  a  distributed  system[88,89]   –  Cannot  differenXate  between  an  slow  system  and  a   crashed  system     •  Completeness   –  Every  system  that  crashed  will  be  eventually   detected   •  Correctness   –  A  correct  process  is  never  suspected   •  In  short,  if  you  are  dead  somebody  will  no<ce  it   and  if  you  are  alive,  nobody  will  mistake  you  for   dead  !  
  • 75.
    Ø  Accrual  Failure  Detector   •  Not    Boolean  value  but  a  probabilisXc  number  that  “accrues”  over   an  exponenXal  scale   •  Captures  the  degree  of  confidence  that  a  corresponding  monitored   process  has  crashed[94]   –  Suspicion  Level   –  Ø  =  1  -­‐>  prob(error)  10%   –  Ø  =  2  -­‐>  prob(error)  1%   –  Ø  =  3  -­‐>  prob(error)  0.1%   •  If  process  is  dead,     –  Ø  is  monotonically  increasing  &  Ø→α  as  t  →α   •  If  process  is  alive  and  kicking,  Ø=0   •  Account  for  lost  messages,  network  latency  and  actual  crash  of   system/process   •  Well  known  heartbeat  period  Δi,  then  network  latency  Δtr  can  be   tracked  by  inter-­‐arrival  Xme  modeling  
  • 76.
    Write/Read  Mechanisms   • Read  &  Write  to  a  random  node   (StorageProxy)   •  Proxy  coordinates  the  read  and  write   strategy  (R/W  =  any,  quorum  et  al)   •  Memtables/SSTables  from  big  table   •  Bloom  Filter/Index   •  LSM  Trees  
  • 77.
    Hbase – WAL, Node Write Memstore, HDFS File system Commit Logs Node M e m o MemTable r y Read Flushing Index Index Index D i BF BF BF s k SSTable • Immutable • Compaction • Maintain Index & Bloom Filter
  • 78.
    How…  does  HBase  work  again?   http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html http://hbaseblog.com/2010/07/04/hug11-hbase-0-90-preview-wrap-up/
  • 79.
    Bloom  Filter   • The  BloomFilter  answers  the  quesXon     •  “Might  there  be  data  for  this  key  in  this   SSTable?”  [Ref:  Cassandra/Hbase  mailer]   –  “Maybe"  or   –   “Definitely  not“   –  When  the  BloomFilter  says  "maybe"  we  have  to  go  to   disk  to  check  out  the  content  of  the  SSTable   •  Depends  on  implementaXon   –  Redone  in  Cassandra   –  Hbase  0.20.x  removed,  will  be  back  in  0.90  with  a   “jazzy”  implementaXon  
  • 80.
    Was it avision, or a waking dream? Fled is that music:—do I wake or sleep? -Keats, Ode to a Nightingale
  • 81.
    •  http://www.readwriteweb.com/enterprise/2011/11/infographic-data- deluge---8-ze.php •  http://www.crn.com/news/data-center/232200061/efficiency-or- bust-data-centers-drive-for-low-power-solutions-prompts-channel- growth.htm •  http://www.quantumforest.com/2011/11/do-we-need-to-deal-with- big-data-in-r/ •  http://www.forbes.com/special-report/2011/migration.html •  http://www.mercurynews.com/bay-area-news/ci_19368103 •  http://www.businessinsider.com/apple-new-data-center-north- carolina-created-50-jobs-2011-11