Ben	  Coverston	     Director	  of	  Opera2ons	  ben.coverston@datastax.com	           Hosted	  By:	        Ma=hew	  O’Kee...
History	  •    Open	  Sourced	  by	  FB	  in	  July	  2008	  •    Apache	  Incubator	  March	  2009	  •    Graduated	  Mar...
You	  Changed	  Your	  Name?	  Why!?	  •  Suits	      –  Marke2ng	      –  Relevancy	      –  Riptano	  too	  “Skateboard”...
Strengths	  •  Scalable	  •  Reliable	      –  Replica2on	  that	  works	      –  Mul2-­‐DC	  Support	      –  No	  Single...
Weaknesses	  •  No	  ACID	  Transac2ons	  •  Limited	  Support	  for	  (OLTP)	  ad-­‐hoc	  queries	  •  ..but	  you	  lost...
A	  Short	  History	  of	  Big	  Data	  (Or	  Why	                      Cassandra)	  •  Rela2onal	  databases	  scale	  po...
First	  A=empt	  
We	  just	  need	  to	  buy	  a	  bigger	  box…	  
We	  Just	  Need	  to	  Cache	  Our	  Data…	  
Add	  a	  few	  more	  Databases	  
A	  Li=le	  Sharding	  
What	  About	  Backup?	  
Add	  Another	  Layer	  of	  Abstrac2on	  
What	  do	  we	  end	  up	  with?	  (“The	  eBay	  Architecture,”	  Randy	  Shoup	  and	  Dan	  Pritche=)	  
BASE	  •  BASE	  is	  diametrically	  opposed	  to	  ACID.	  Where	     ACID	  is	  pessimis2c	  and	  forces	  consistenc...
Myth	  •  Lack	  of	  ACID	  means	  that	  I	  have	  to	  give	  up	     transac2onal	  guarantees	  and	  consistency.	...
Cassandra	  In	  Produc2on	  •    Nellix	  :	  Streaming	  Bookmarks	  •    Digital	  Reasoning:	  NLP	  &	  En2ty	  Analy...
Who	  is	  inves2ng	  in	  Cassandra?	  •  DataStax	  •  Twi=er:	     –  Were	  inves2ng	  in	  Cassandra	  every	  day.	 ...
Durability	  •  Write	  to	  Commit	  Log	      –  fsync	  is	  cheap	  (append	  only)	      –  Latency	  is	  only	  sub...
Log	  Structured	  Storage	  
Tuneable	  Consistency	  •  One,	  Quorum,	  All	  •  R	  +	  W	  >	  N	  •  Choose	  availability	  vs	  consistency	  (l...
The	  Ring	  
Adding	  A	  Node	  
Adding	  A	  Node	  (Con2nued)	  
Bootstrapping	  
Consistent	  Hashing	  •  Hash	  Func2on	  -­‐-­‐	  K	  à	  T	       –  Let’s	  call	  this	  |k|	  (hash	  of	  k)	  for...
Par22oning	  
Replica2on	  •  Simple	  Replica2on	  Strategy	  •  Network	  Topology	  Strategy	     –  How	  many	  replicas	  in	  eac...
Replica2on	  
Coordinators	  •  Each	  Node	  can	  be	  a	  coordinator	  •  Manages	  wri2ng,	  read	  repair.	  •  Success	  depends	...
Coordinators	  
Reliability	  •  No	  Single	  Points	  of	  Failure	  •  Mul2ple	  Datacenters	  •  Monitorable	      –  JMX	  (or	  what...
Expecta2on	  of	  Failure	  •  C*	  is	  designed	  to	  fail	  •  No	  “Clean	  Shutdown”	  •  kill	  -­‐9,	  it’s	  ok.	  
Failure	  
Failure	  
Failure	  
Hinted	  Handoff	  
Decommission	  (RF3)	  
Repair	  
Keyspaces	  and	  ColumnFamilies	  •  Loosely	  analogous	  to	  “Schemas”	  and	  “Tables”	  
Inside	  CFs,	  columns	  are	  dynamic	  l    Twi=er:	  “Fiveen	  months	  ago,	  it	  took	  two	        weeks	  to	  p...
ColumnFamilies	  l    Sta2c	        l    Object	  data	  l    Dynamic	        l    Precalculated	  query	  results	  
“sta2c”	  columnfamilies	                                  Userszznate	          Password:	  *	       Name:	  Nate	   driv...
“dynamic”	  columnfamilies	                                   Followingzznate	         drivx:	      thobbs:	   drivx	  tho...
Inser2ng	  l    Really	  “insert	  or	  update”	  l    Not	  a	  key/value	  store	  –	  update	  as	  much	  of	  the	 ...
Column	  indexes	  l    Name	  vs	  range	  filters	  l    “reversed=true”	        l    Special	  case:	  forward-­‐scan...
Example:	  Twissandra	  •  h=p://twissandra.com	  
Tweets	  RowKey: 92dbeb50-ed45-11df-a6d0-000c29864c4f=> (column=body, value=Four score and seven years ago,timestamp=12894...
Userline	  RowKey: ericflo=> (column=1289446393708810, value=6a0b4834-ed44-11df-bc31-000c29864c4f, timestamp=1289446393710...
Userline	                   1289847840615:	  3f19757a-­‐zznate	                                                 1289847887...
Timeline	  RowKey: ericflo=> (column=1289446393708810, value=6a0b4834-ed44-11df-bc31-000c29864c4f, timestamp=1289446393710...
Adding	  a	  tweet	  tweet_id = str(uuid())body = @ericflo thanks for Twissandra, it helps!timestamp = long(time.time() * ...
Reads	  timeline = USERLINE.get(uname, column_reversed=True)tweets = TWEET.multiget(timeline.values())start = request.GET....
I	  can	  has	  smarter	  clients?	  l    Dont	  use	  thriv	  directly	  l    Higher	  level	  clients	  have	  a	  lot...
Raw	  thriv	  API:	  Connec2ng	  def get_client(host=127.0.0.1, port=9170):    socket = TSocket.TSocket(host, port)    tra...
Raw	  thriv	  API:	  Inser2ng	  data = {id: useruuid, ...}columns = [Column(k, v, time.time())           for (k, v) in dat...
Raw	  thriv	  API:	  Fetching	  l     get,	  get_slice,	  get_count,	  mul2get_slice,	         get_range_slices	  l     ...
API	  layers	  Layer	             Analog	  libpq	             Thriv	  JDBC	              Hector	  JPA	               Kunde...
Language	  support	  l    Python	         l    pycassa	         l    telephus	  l    Ruby	         l    Speed	  is	  ...
Done	  yet?	  l    S2ll	  doing	  1+N	  queries	  per	  page	  l    Solu2on:	  Supercolumns	  l    Err..	  Well	  maybe...
Supercolumns:	  limita2ons	  l     Requires	  reading	  an	  en2re	  SC	  (not	  the	  en2re	         row)	  from	  disk	...
UUIDs	  l    Column	  names	  should	  be	  uuids,	  not	  longs,	  to	        avoid	  collisions	  l    Version	  1	  U...
0.7: secondary indexes  Obviate need for Userline (but not Timeline)l 
Lucandra	  l    What	  documents	  contain	  term	  X?	        l    …	  and	  term	  Y?	        l    …	  or	  start	  w...
FAQ:	  coun2ng	  l    UUIDs	  +	  batch	  process	  l    Mutex	  (contrib/mutex	  or	  “cages”)	  l    Use	  redis	  or...
Tips	  l    Insert	  instead	  of	  check-­‐then-­‐insert	  l    Use	  client-­‐side	  clock	  to	  your	  advantage	  l...
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
Upcoming SlideShare
Loading in …5
×

Ben Coverston - The Apache Cassandra Project

2,522 views

Published on

Abstract:
Cassandra is a new kind of database: it is more than a single-machine system. It naturally runs in a High-Availability configuration. All nodes in the system are symmetric; there is no single point of failure. As you add machines, failure becomes routine, and Cassandra is built to tolerate that with no interruptions.

Cassandra is linearly scalable with good performance characteristics for very small and very large data stores. Unlike earlier efforts, Cassandra is more than just a key-value store; it is a structured data store which can facilitate complex use cases and queries. Cassandra allows for random access to your data organized into rows and columns.

Cassandra is different, and exciting. This presentation will discuss the pros and cons of using Cassandra, and why it has seen such amazing adoption in the past year.

Bio:
Ben Coverston is Director of Operations at DataStax (formerly knows as Riptano), a provider of software, support, services, training, resources and help for Cassandra. He has been involved in enterprise software his entire career. Working in the airline industry, he helped to build some of the highest volume online booking sites in the world. He saw first hand the consequences of trying to solve real world scalability problems at the limit of what traditional relational databases are capable of.

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,522
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
68
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Ben Coverston - The Apache Cassandra Project

  1. 1. Ben  Coverston   Director  of  Opera2ons  ben.coverston@datastax.com   Hosted  By:   Ma=hew  O’Keefe   MorningStar      
  2. 2. History  •  Open  Sourced  by  FB  in  July  2008  •  Apache  Incubator  March  2009  •  Graduated  March  2010  •  Riptano  Founded  April  2010  •  First  Summit  August  2010  •  Riptano  Changed  to  Datastax  January  2011  
  3. 3. You  Changed  Your  Name?  Why!?  •  Suits   –  Marke2ng   –  Relevancy   –  Riptano  too  “Skateboard”  •  The  Real  Reason?   –  “The  X  makes  it  sound  cool.”  –  Bender  Bending   Rodriguez,  Futurama  
  4. 4. Strengths  •  Scalable  •  Reliable   –  Replica2on  that  works   –  Mul2-­‐DC  Support   –  No  Single  Point  of  Failure  •  Analy2cs  in  the  same  system  as  OLTP  (with   “integrated”  Hadoop  support)  
  5. 5. Weaknesses  •  No  ACID  Transac2ons  •  Limited  Support  for  (OLTP)  ad-­‐hoc  queries  •  ..but  you  lost  that  when  you  started  to  shard   your  rela2onal  system.    
  6. 6. A  Short  History  of  Big  Data  (Or  Why   Cassandra)  •  Rela2onal  databases  scale  poorly  •  B-­‐trees  are  slow   –  ..and  require  read  before  write.   –  ..hope  your  dataset  fits  in  memory  
  7. 7. First  A=empt  
  8. 8. We  just  need  to  buy  a  bigger  box…  
  9. 9. We  Just  Need  to  Cache  Our  Data…  
  10. 10. Add  a  few  more  Databases  
  11. 11. A  Li=le  Sharding  
  12. 12. What  About  Backup?  
  13. 13. Add  Another  Layer  of  Abstrac2on  
  14. 14. What  do  we  end  up  with?  (“The  eBay  Architecture,”  Randy  Shoup  and  Dan  Pritche=)  
  15. 15. BASE  •  BASE  is  diametrically  opposed  to  ACID.  Where   ACID  is  pessimis2c  and  forces  consistency  at   the  end  of  every  opera2on,  BASE  is  op2mis2c   and  accepts  that  the  database  consistency  will   be  in  a  state  of  flux.  Although  this  sounds   impossible  to  cope  with,  in  reality  it  is  quite   manageable  and  leads  to  levels  of  scalability   that  cannot  be  obtained  with  ACID.   –  Dan  Pritche=  –  NoSQL  Pioneer,  Ebay  Engineer   h=p://queue.acm.org/detail.cfm?id=1394128  
  16. 16. Myth  •  Lack  of  ACID  means  that  I  have  to  give  up   transac2onal  guarantees  and  consistency.  •  Paraphrasing:  At  Nellix  we  tend  to  be   op2mis2c.  When  things  don’t  quite  work  out   we  try  again.   –  Siddharth  Andand  •  Achievable  
  17. 17. Cassandra  In  Produc2on  •  Nellix  :  Streaming  Bookmarks  •  Digital  Reasoning:  NLP  &  En2ty  Analy2cs  •  OpenX:  largest  publisher-­‐side  ad  network  •  Cloudkick:  performance  data  &  aggrega2on  •  SimpleGeo:  loca2on-­‐as-­‐API  •  Ooyala:  video  analy2cs  and  business  intelligence  •  ngmoco:  massively  mul2player  online  game  worlds  •  Kosmix:  social  media  aggrega2on  •  Reddit:  vote  tracking  system  •  Twi=er:  Rainbird,  geo  data,  analy2cs  •  …  lots  more  
  18. 18. Who  is  inves2ng  in  Cassandra?  •  DataStax  •  Twi=er:   –  Were  inves2ng  in  Cassandra  every  day.  Itll  be   with  us  for  a  long  2me  and  our  usage  of  it  will  only   grow.    •  Rackspace  •  >  100  different  individuals  have  submi=ed   patches  to  C*  •  You?  
  19. 19. Durability  •  Write  to  Commit  Log   –  fsync  is  cheap  (append  only)   –  Latency  is  only  subject  to  rota2onal  latency   •  Separate  par22on  (no  seeking)   •  SSD  won’t  hurt,  but  it  may  not  help  either.  •  Write  to  memtable  •  Flush  memtable  to  SSTable  
  20. 20. Log  Structured  Storage  
  21. 21. Tuneable  Consistency  •  One,  Quorum,  All  •  R  +  W  >  N  •  Choose  availability  vs  consistency  (latency)  
  22. 22. The  Ring  
  23. 23. Adding  A  Node  
  24. 24. Adding  A  Node  (Con2nued)  
  25. 25. Bootstrapping  
  26. 26. Consistent  Hashing  •  Hash  Func2on  -­‐-­‐  K  à  T   –  Let’s  call  this  |k|  (hash  of  k)  for  our  examples  •  Par22oner  Determines  Loca2on  in  the  Ring    
  27. 27. Par22oning  
  28. 28. Replica2on  •  Simple  Replica2on  Strategy  •  Network  Topology  Strategy   –  How  many  replicas  in  each  datacenter  for  each   keyspace?   –  Generaliza2on  of  Rack  Aware  Strategy  
  29. 29. Replica2on  
  30. 30. Coordinators  •  Each  Node  can  be  a  coordinator  •  Manages  wri2ng,  read  repair.  •  Success  depends  on  per-­‐call  CL  request  
  31. 31. Coordinators  
  32. 32. Reliability  •  No  Single  Points  of  Failure  •  Mul2ple  Datacenters  •  Monitorable   –  JMX  (or  whatever  plugs  into  it  –  lots  of  counters)   –  Cac2   –  Munin   –  Nagios  
  33. 33. Expecta2on  of  Failure  •  C*  is  designed  to  fail  •  No  “Clean  Shutdown”  •  kill  -­‐9,  it’s  ok.  
  34. 34. Failure  
  35. 35. Failure  
  36. 36. Failure  
  37. 37. Hinted  Handoff  
  38. 38. Decommission  (RF3)  
  39. 39. Repair  
  40. 40. Keyspaces  and  ColumnFamilies  •  Loosely  analogous  to  “Schemas”  and  “Tables”  
  41. 41. Inside  CFs,  columns  are  dynamic  l  Twi=er:  “Fiveen  months  ago,  it  took  two   weeks  to  perform  ALTER  TABLE  on  the   statuses  [tweets]  table.”  
  42. 42. ColumnFamilies  l  Sta2c   l  Object  data  l  Dynamic   l  Precalculated  query  results  
  43. 43. “sta2c”  columnfamilies   Userszznate   Password:  *   Name:  Nate   drivx   Password:  *   Name:  Brandon  thobbs   Password:  *   Name:  Tyler  jbellis   Password:  *   Name:  Jonathan   Site:  riptano.com  
  44. 44. “dynamic”  columnfamilies   Followingzznate   drivx:   thobbs:   drivx  thobbs   zznate:   pcmanusjbellis   drivx:   mdennis:   thobbs:   xedin:   zznate:   :  
  45. 45. Inser2ng  l  Really  “insert  or  update”  l  Not  a  key/value  store  –  update  as  much  of  the   row  as  you  want  
  46. 46. Column  indexes  l  Name  vs  range  filters  l  “reversed=true”   l  Special  case:  forward-­‐scan  star2ng  with  beginning   of  row  is  fastest  
  47. 47. Example:  Twissandra  •  h=p://twissandra.com  
  48. 48. Tweets  RowKey: 92dbeb50-ed45-11df-a6d0-000c29864c4f=> (column=body, value=Four score and seven years ago,timestamp=1289446891681799)=> (column=username, value=alincoln,timestamp=1289446891681799)-------------------RowKey: d418a66e-edc5-11df-ae6c-000c29864c4f=> (column=body, value=Do geese see God?,timestamp=1289501976713199)=> (column=username, value=pdrome,timestamp=1289501976713199)
  49. 49. Userline  RowKey: ericflo=> (column=1289446393708810, value=6a0b4834-ed44-11df-bc31-000c29864c4f, timestamp=1289446393710212)=> (column=1289446397693831, value=6c6b5916-ed44-11df-bc31-000c29864c4f, timestamp=1289446397694646)=> (column=1289446891681780, value=92dbeb50-ed45-11df-a6d0-000c29864c4f, timestamp=1289446891685065)=> (column=1289446897315887, value=96379f92-ed45-11df-a6d0-000c29864c4f, timestamp=1289446897317676)
  50. 50. Userline   1289847840615:  3f19757a-­‐zznate   1289847887086:  a20fcf52-­‐595c...   c89d...   drivx  thobbs   1289847887086:  a20fcf52-­‐595c...   1289847840615:  3f19757a-­‐ 1289847844275:  844e75e2-­‐jbellis   c89d...   b546...  
  51. 51. Timeline  RowKey: ericflo=> (column=1289446393708810, value=6a0b4834-ed44-11df-bc31-000c29864c4f, timestamp=1289446393710212)=> (column=1289446397693831, value=6c6b5916-ed44-11df-bc31-000c29864c4f, timestamp=1289446397694646)=> (column=1289446891681780, value=92dbeb50-ed45-11df-a6d0-000c29864c4f, timestamp=1289446891685065)=> (column=1289446897315887, value=96379f92-ed45-11df-a6d0-000c29864c4f, timestamp=1289446897317676)
  52. 52. Adding  a  tweet  tweet_id = str(uuid())body = @ericflo thanks for Twissandra, it helps!timestamp = long(time.time() * 1e6)columns = {uname: useruuid, body: body}TWEET.insert(tweet_id, columns)columns = {ts: tweet_id}USERLINE.insert(uname, columns)TIMELINE.insert(uname, columns)for follower_uname in FOLLOWERS.get(uname, 5000): TIMELINE.insert(follower_uname, columns)
  53. 53. Reads  timeline = USERLINE.get(uname, column_reversed=True)tweets = TWEET.multiget(timeline.values())start = request.GET.get(start)limit = NUM_PER_PAGEtimeline = TIMELINE.get(uname, column_start=start,column_count=limit, column_reversed=True)tweets = TWEET.multiget(timeline.values())
  54. 54. I  can  has  smarter  clients?  l  Dont  use  thriv  directly  l  Higher  level  clients  have  a  lot  of  features  you   want   l  Knowledge  about  data  types   l  Connec2on  pooling   l  Automa2c  retries   l  Logging  
  55. 55. Raw  thriv  API:  Connec2ng  def get_client(host=127.0.0.1, port=9170): socket = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(socket) transport.open() protocol =TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) return client
  56. 56. Raw  thriv  API:  Inser2ng  data = {id: useruuid, ...}columns = [Column(k, v, time.time()) for (k, v) in data.items()]mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns]rows = {useruuid: {User: mutations}}client.batch_mutate(Twissandra, rows,ConsistencyLevel.ONE)
  57. 57. Raw  thriv  API:  Fetching  l  get,  get_slice,  get_count,  mul2get_slice,   get_range_slices  l  ColumnOrSuperColumn  l  h=p://wiki.apache.org/cassandra/API    
  58. 58. API  layers  Layer   Analog  libpq   Thriv  JDBC   Hector  JPA   Kundera  
  59. 59. Language  support  l  Python   l  pycassa   l  telephus  l  Ruby   l  Speed  is  a  nega2ve  l  Java   l  Hector  l  PHP  (soon  with  less  suckage!)  
  60. 60. Done  yet?  l  S2ll  doing  1+N  queries  per  page  l  Solu2on:  Supercolumns  l  Err..  Well  maybe…  
  61. 61. Supercolumns:  limita2ons  l  Requires  reading  an  en2re  SC  (not  the  en2re   row)  from  disk  even  if  you  just  want  one   subcolumn  l  No  Secondary  Indexes  l  It’s  just  an  extra  map  layer.  l  Probably  best  to  avoid  them  if  you  can.    
  62. 62. UUIDs  l  Column  names  should  be  uuids,  not  longs,  to   avoid  collisions  l  Version  1  UUIDs  can  be  sorted  by  2me   (“TimeUUID”)  l  Any  UUID  can  be  sorted  by  its  raw  bytes   (“LexicalUUID”)   l  Usually  Version  4   l  Slightly  less  overhead  
  63. 63. 0.7: secondary indexes Obviate need for Userline (but not Timeline)l 
  64. 64. Lucandra  l  What  documents  contain  term  X?   l  …  and  term  Y?   l  …  or  start  with  Z?  
  65. 65. FAQ:  coun2ng  l  UUIDs  +  batch  process  l  Mutex  (contrib/mutex  or  “cages”)  l  Use  redis  or  mysql  or  memcached  l  column-­‐per-­‐app-­‐server  l  counter  API  (aver  .7  is  out)  
  66. 66. Tips  l  Insert  instead  of  check-­‐then-­‐insert  l  Use  client-­‐side  clock  to  your  advantage  l  use  TTL  l  Wider  rows  (but  not  too  wide)    l  Start  with  queries,  work  backwards  l  Avoid  storing  extra  “2mestamp”  columns  

×