Exalead managing terrabytes

2,269 views

Published on

Published in: Technology
2 Comments
3 Likes
Statistics
Notes
No Downloads
Views
Total views
2,269
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
58
Comments
2
Likes
3
Embeds 0
No embeds

No notes for slide

Exalead managing terrabytes

  1. 1. Content   •  Introduc*on   •  Databases   –  ACID   –  Data  structures,  algorithms   –  Scalability  issues   –  Scaling  pa=erns   •  Search  engines   –  Data  structures,  algorithms   –  Pros  &  cons   •  NoSQL  Movement   –  Why  and  What   1
  2. 2. Content   •  NoSQL  Families   –  Key  value  stores   –  Column  stores   –  Document  stores   –  Graph  DB   •  Principles:  CAP,  Scaling  pa=erns,  High  availability   pa=erns,  Elas*city   •  How  to  choose  ?   •  Conclusion   2
  3. 3. Introduc,on   • Who  we  are:   – Clément  STENAC  (Indexing  and  search  techs)     – Jérémie  BORDIER  (360  team  (a  bit  of  everything))     • Exalead:   – Indexing  technologies  provider  since  1998   – Online  search  engine:  h=p://www.exalead.com   – Daily  challenge:  Tackle  informa*on  access   problems  for  large  companies.   3
  4. 4. Introduc,on   • Universal  answer  to  data  storage:            RELATIONAL  DATABASES   • Well  known  data  representa*on:  Objects   and  rela*onships   • Powerful  query  language:  SQL   • Open  source  implementa*ons:   – MySQL   – PostgreSQL   – …   4
  5. 5. Introduc,on   • Database  scalability  problems  ?   • Used  to  be  a  Telco  and  bank  problem…   • Un*l  the  internet  has  come  !   Twitter whale, 2008 5
  6. 6. Introduc,on   • Thanks  to  the  internet…   • …millions  of  rows  is  frequent…   • …  real  *me  websites.   How  to  deal  with  massive  amount  of   structured  data  ?  Are  there  alterna*ves  ?   What’s  this  NoSQL  buzz  ?   6
  7. 7. Knowing  your  enemy:   RELATIONAL  DATABASES   7
  8. 8. Databases:  ACID   ACID  constraints   • Atomicity   • Transac*ons  succeed  or  fail  atomically   • Consistency   • Transac*ons  leave  the  database  in  a  consistent   state   • Isola,on   • Transac*ons  do  not  see  the  effects  of  concurrent   transac*ons   • Durability   • Once  a  transac*on  is  commi=ed,  it  can’t  be  lost  
  9. 9. Database  structures   Primary  storage   CREATE TABLE author ( Heuris*cs  change  it   id INTEGER PRIMARY KEY, nick VARCHAR(16), Fixed size to  variable-­‐size   age INTEGER, firstname VARCHAR(128), biography TEXT); Variable size CREATE TABLE post ( Each  value  or  pointer   id INTEGER PRIMARY KEY, can  be  retrieved  at  a   author_id FOREIGN KEY REFERENCES author(id); timestamp TIMESTAMP, known  offset  in  the  row     title VARCHAR(256), text TEXT); Id age nick firstname biography Row 1 4 bytes 4 bytes 16 bytes pointer pointer Id age nick firstname biography Row 2 4 bytes 4 bytes 16 bytes pointer pointer Table strings len data len data len data len data
  10. 10. Searching  in  a  database   SELECT * FROM author WHERE age=24; The  raw  way:  full  scan   • Enumerate  all  records  in  the  table   • For  each  record,  fetch  the  condi*on  value   • Inline  value:  direct  access  at  row_address + offset(column) • Outside  value  :  fetch  pointer  and  fetch  data   • Perform  comparison   Analysis   • Need  to  analyse  the  full  table   • Very  CPU  intensive   • If  the  table  does  not  fit  in  memory  ?  –  I/O  on  the  whole  table  
  11. 11. Database  structures   Indexes   What  is  an  index  ?   • Primary  storage:  forward  mapping   row_id –> row data • Index  :  reverse  mapping   row data –> row_id(s) • Updated  together  with  the  primary  storage     Searching  with  an  index   • Retrieve  the  row  ids  using  the  index   • Fetch  the  row  data  from  primary  storage  
  12. 12. Database  structures   Indexes  –  Hash  index   How  it  works   • Stores  hashes  of  column  values  in  as  hash-­‐table   • Retrieve  through  the  hash  table   Pros   • Very  easy  and  fast  to  update   • Fast  lookup  –  single  hashtable  lookup   Cons     • Only  provides  equality  matching   • Unable  to  answer  inequality  queries  
  13. 13. Database  structures   Indexes  –  BTree  index   Binary search tree B-Tree Pros   • Provides  range  and  inequality  queries  easily   • Quite  fast  (logarithmic)  opera*ons   Cons     • More  complex    and  expensive  to  update   • B-­‐Tree  rebalancing  
  14. 14. Choosing  how  to  search   Is  indexed  search  always  be=er  ?   • SELECT * from author where age < 300; Analysis   •  Fetch  of  whole  table   •  Index:  random  lookups   •  Full  scan  :  sequen*al  fetch   Choosing  wisely   • Iden*fy  the  expensive  queries   • Use  the  EXPLAIN  statement   • Only  add  indexes  where  they  are  required   • Indexes  are  expensive  to  update  
  15. 15. Joining   Goal   • Put  together  data  from  several  tables   • For  some  values  in  table  A,  find  matching  values   in  table  B   Example   •  ELECT * FROM post S INNER JOIN author ON author.id = post.author_id WHERE author.age = 42;
  16. 16. Join  algorithms   Nested  loops   • Foreach (author WHERE age=42) { Foreach(post) { if (post.author_id == author.id) { append post to the result set; } } } • Very  naive  algorithm  :  runs  in  PxA  *me   • Provides  all  predicates   Hash  join   • Algorithm   • Make  a  hashtable  of  author  ids  matching  the  «  age  =  42  »  condi*on   • Scan  once  the  post  table   • For  each  post,  lookup  in  the  hashtable  to  check  if  it  matches  a  valid  author     • Faster  than  nested  loops  (2  scans  instead  of  A)   • Requires  memory  to  store  the  hashtable   • Only  provides  equality  predicate  
  17. 17. Join  algorithms   Merge  join   • Need  to  have  both  tables  sorted  by  join  key   • Post  sorted  by  author_id   • Author  sorted  by  id   • Perform  a  single  parallel  scan  of  the  two  tables  and  iden*fy  matches   • Fastest  algorithm,  but  needs  sorted  data   • Disk-­‐based  sort  for  large  data  sets   Choice  of  join  algorithm   • Performed  automa*cally  by  the  query  op*mizer  (EXPLAIN)   • Main  parameters:   • Rela*ons  cardinali*es   • Data  order  (presence  of  an  ORDER  BY  clause  ?)   • Available  indexes   • JOIN  are  always  expensive  -­‐>  schema  denormaliza,on  
  18. 18. Database  scaling     Typical  workloads   Mostly  read  workloads   • Example:  Wikipedia   • First  solu*on:  high-­‐level  (frontend  *er)  caching   • Database  scaling  :  1  master  –  N  slaves   • Replica,on  of  changes  from  master  to  slaves   • Does  not  solve  the  write  bo=leneck  problem   High  write  workloads   • Examples:  credit  cards,                                        Twi=er  (>1000  tweets/second,  1000s  of  deliveries)   • Performance  limited  by  write  I/O  throughput   • Because  of  the  «  D  »  constraint   • Hard  to  have  more  than  1000-­‐2000  writes/second  
  19. 19. Database  scaling     Scaling  writes   Mul*ple  master  setups   •  All  masters  have  the  same  data  and  share  the  updates   •  «  share-­‐all  »  cluster  architecture   •  Extremely  complex  synchroniza*on   •  Bi-­‐direc*onal  replica*on   •  Conflict  detec*on   •  Bad  performance   •  Complex  resilience   •  Down*me  of  a  master:  need  a  resync     •  Complex,  heavy  and  expensive  architectures   Bi-directional Client 1 Master replication flow Master Client 2 1 2
  20. 20. Database  scaling     Scaling  writes   Sharding   • Split  the  data  between  the  masters  based  on  a   criterion   • Date   • User  id   •   hash(url),  …   • Clients  query  the  correct  master  for  each  data   • No  shared  data  between  masters  («  share-­‐nothing  »)   Client 1 Master Master 1 2 Client 2
  21. 21. Database  scaling     Problems  with  SQL  sharding   Complexity   • Not  integrated  in  SQL   • Need  to  perform  the  sharding  in  applica*ve  code   Resilience   • Several  machines  but  no  resilience   • Loss  of  one  master  =  loss  of  data  (compare  to  RAID-­‐0)   Loss  of  features   • You  can’t  do  cross-­‐shard  joins   Complex  evolu*ons   • How  do  you  keep  scaling  ?   • To  add  another  machine,  you  need  to  change  the  distribu*on  func*on  
  22. 22. Database  scaling     Other  SQL  shortcomings   Strict  schema   • It  is  good,  it  provides  strong  typing   • But,  migra*on  hell  !   • Web  applica*ons  changes  quickly   • Not  «  Agile  »  
  23. 23. On  the  other  side:   SEARCH  ENGINES   23
  24. 24. A  quick  look  at  search  engines   Differences  from  a  tradi*onal  database   • Not  designed  for  OLTP   • Update  by  batches   • No  transac*ons,  updates  are  available  to  readers   «  later  »   • Heavily  read-­‐op*mized   Full  text  search   • It’s  more  complex  than    LIKE ’%myword%’; • Need  specific  data  structures  
  25. 25. Search  engines   Inverted  lists   What  is  is   • A  data  structure  mapping  a  «  word  iden*fier  »  to  a  list  of  «  document   iden*fier  »   • For  each  word  of  each  document,  store  the  posi*ons   Document  1   List  for  word  3  (fox)   List  for  word  1  (the)   • doc  1  (at  posi*on  2)     The  quick  fox   • doc  1  (at  posi*on  0)     • the  =  1     • doc  2  (at  posi*on  0)     Document  2   • quick  =  2     • doc  3  (at  posi*on  0)     List  for  word  4  (lazy)   • fox  =  3     The  lazy  dog   • lazy  =  4     • doc  2  (at  posi*on  1)     • dog  =  5     List  for  word  2  (quick)   Document  3   • doc  1  (at  posi*on  1)     • doc  3  (at  posi*on  2)     List  for  word  5  (dog)   • doc  2  (at  posi*on  2)     The  dog  quick  dog   • doc  3  (at  posi*ons  1,  3)     Exalead S.A. © 2010 CONFIDENTIAL
  26. 26. Search  engines   Searching  with  inverted  lists   Single  word  query  :  dog   • Resolve  the  word  to  its  id  using  the  dic*onary  (wid  5)   • Fetch  the  inverted  list  for  this  id   • Simply  read  the  inverted  list  for  its  id     • We  have  the  hits:  document  2  and  document  3   Boolean  query:  the  AND  dog   • Resolve  words,  fetch  inverted  lists   • The: 1,2,3 Dog: 2,3 • Perform  intersec*on:    hits  =  2,3   Boolean  query  :  the  OR  dog   • Resolve/fetch   • Perform  union:  hits  =  1,  2,  3   Exalead S.A. © 2010 CONFIDENTIAL
  27. 27. Search  engines   Searching  with  inverted  lists   Posi*onal  query:  the  NEXT  dog   • Fetch  the  inverted  lists  and  also  read  the  posi*ons   • The : 1(0), 2(0), 3(0) Dog : 2(2), 3(1,3) • Iden*fy  “simple  boolean”  matches:  docs    2  and  3   • For  each  possible  match,    check  if  posi*ons  form  a   sequence   • Only  document  3  matches  on  sequence  (0,1)   • Posi*onal  queries  are  more  expensive  and  storing   word  posi*ons  is  expensive  (disk  space,  decoding   CPU,  I/O)   Exalead S.A. © 2010 CONFIDENTIAL
  28. 28. The  revolu*on:   THE  NOSQL  MOVEMENT   28
  29. 29. NoSQL  Movement   • «  NoSQL  »  ©  Eric  VANS  (Rackspace,  2009)   The  name  was  an  a=empt  to  describe  the   emergence  of  a  growing  number  of  non-­‐ rela*onal,  distributed  data  stores  that  ozen  did   not  a=empt  to  provide  ACID  guarantees. Wikipedia 29
  30. 30. NoSQL  Movement:  Issue   • RDBMS  fails  with  huge  amount  of  data   – Facebook’s  70TB  of  inbox   – Digg’s  3TB   – eBay’s  2PB…   • High  scale  SQL  systems  are  either:   – Very  expensive  to  buy  and  quite  to  maintain   – Very  expensive  to  maintain   30
  31. 31. NoSQL  Movement   • We  need  new  systems  that:   – Scales  horizontally  (both  read/write)   – Have  no  single  point  of  failure   – Are  fault  tolerant   – Are  elas*cs  (adding  nodes  is  easy)   – Have  flexible  data  schemas   – Are  more  web  applica*ons  friendly   31
  32. 32. NoSQL:  Families   • Different  types  of  data  stores:   – Key-­‐Value  stores  (Dynamo,  Redis,  Voldemort…)   – Column  stores  (BigTable,  Cassandra,  HBase…)   – Document  stores  (CouchDB,  MongoDB…)   – Graph  stores  (Neo4J,  Swarm…)   32
  33. 33. NoSQL:  Key-­‐Value  stores   •  Distributed  hashtables   –  Btrees   –  Fixed  sized  tables   •  Benefits:   –  Very  simple  API  (get/put/delete/range)   –  Easily  shardable   –  Fast  reads   •  Drawbacks:   –  No  data  schema  (no  joins,  data  fla=ening…)   –  No  query  language   •  Implems:  Redis,  Amazon  Dynamo,  Voldemort   33
  34. 34. NoSQL:  Column  Stores   Id   Lastname   Firstname   Salary   1   Smith   Joe   40000   2   Jones   Mary   50000   3   Johnson   Cathy   44000   •  Row  based  storage:   –  1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;   •  Column  based  storage:   –  1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;   34
  35. 35. NoSQL:  Column  Stores   • Benefits:   – Reading  all  the  values  of  a  given  column  is   faster  (ex:  aggregates)   – Batch  writes  are  faster   • Joins  are  faster   – Comparing  two  columns  is  sequen*al   – Much  more  L1  CPU  cache  hits   – L1  cache  reference:  0.5ns   – L2  cache  reference:  7ns   35
  36. 36. NoSQL:  Column  Stores   • Drawbacks:   – Reading  a  single  object  is  slower  (mul*  ios)   – Wri*ng  a  single  object  is  slower  (mul*  ios)   – Doesn’t  fit  to  most  applica*ons   •  Finally:   – Well  suited  for  heavy  write  /  read  applica*ons   •  (eg:  Facebook  inbox  indexes)   36
  37. 37. NoSQL:  Document  Stores   • Can  be  seen  as  schema  free,  hierarchical   database  (usually  represented  as  JSON)     SQL Schema: Document store: Person:  -­‐  id Person: - name  -­‐  id 1  -­‐  address    -­‐  id   - name Animal: - phone  -­‐  person_id    -­‐  address    -­‐  id   - animals =  -­‐  name   - phone N - person_id  -­‐  address   - name  -­‐  phone    -­‐  address   - phone 37
  38. 38. NoSQL:  Document  Stores   • Benefits:   – Data  spa*ality  !  Everything  in  one  place   – Efficient  write  and  updates  (in  place)   – Efficient  read   – Highly  flexible  data  schema   – Usually  provides  indexes  over  each  object  key   to  have  powerful  query  language   • Drawbacks   – Doesn’t  encourage  well  designed  data  schema     38
  39. 39. NoSQL:  Graph  Stores   • An  entry  is  a  node   • Nodes  have  proper*es   • Edges  are  links  between  nodes     39
  40. 40. NoSQL:  Graph  Stores   • Benefits:   – Faster  to  fetch  an  entry  and  its  related  entries   (links  are  already  resolved,  no  need  to  join)   – Flexible  data  schema   • Drawbacks:   – Complex  APIs   – Slow  for  batch  opera*ons   – Open  source  implems  are  not  that  good…   40
  41. 41. The  real  issues…   SCALABILITY  IN  PRACTICE   41
  42. 42. CAP  Theorem   • CAP:   – Consistency:  Opera*ng  fully  or  not  at  all.   – Availability:  The  service  must  be  reachable  at   any  *me.   – Par,,on  Tolerance:  No  set  of  failures  less  than   total  network  failure  is  allowed  to  cause  the   system  to  respond  incorrectly.   Any  shared-­‐data  system  can  only  achieve  two  of   these  three. CAP Theorem, Dr. Eric Brewer, Berkeley (2000) 42
  43. 43. Consistent  Hashing   • Ensuring  data  availability:  replica*on  !   • Reaching  the  right  nodes  ?  Hashing   • Consistent  hashing:  Hash  ring   – Objects  are  mapped  into  a  range   – Nodes  are  mapped  into  that   range   – We  write  the  object  into  the   nearest  node,  clockwise   43
  44. 44. Data  consistency   •  Ensuring  data  eventual  consistency:  Quorum  writes   –  W  =  number  of  writes  to  ensure  before  returning  OK   –  R  =  number  of  reads  to  ensure   –  N  =  replica*on  factor   •  W  <  N  ==  High  write  availability   –  Data  may  be  lost  or  outdated  if  read  from  another  node   •  R  <  N  ==  High  read  availability   –  Data  may  be  outdated   •  W  +  R  >  N  ==  Full  consistency  !   –  But  slower  writes  /  reads       44
  45. 45. Conflicts  resolu,on   •  What  happens  when  R  >  1  and  two  different  versions   are  found  ?   •  Conflict  resolu*on  !   •  Common  algorithm:   Vector  clocks       45
  46. 46. Vector  clocks   • Assign  to  each  node  a  unique  ID   • A  node  increments  its  own  vector  and  keep   track  of  the  old  entries   46
  47. 47. Elas,city:  Gossip  Membership   • When  a  node  joins…   47
  48. 48. Elas,city:  Gossip  Membership   • When  a  node  crashes  !   48
  49. 49. I’m  star*ng  the  next  big  startup…   WHAT’S  THE  BEST  SYSTEM  ?  
  50. 50. Choosing  your  storage  system   • “Don’t  op,mize  too  early”   • MySQL  is  robust  and  works  VERY  well   – You’ll  know  where  bugs  come  from  (you)   • Key-­‐Value  stores  are  hype,  and  o`en  badly   implemented   • Anyway,  most  mature  “NoSQL”  systems:   – MongoDB   – Cassandra           50
  51. 51. Ques,ons   ?  

×