Ben Coverston - The Apache Cassandra Project

  • 2,139 views
Uploaded on

Abstract: …

Abstract:
Cassandra is a new kind of database: it is more than a single-machine system. It naturally runs in a High-Availability configuration. All nodes in the system are symmetric; there is no single point of failure. As you add machines, failure becomes routine, and Cassandra is built to tolerate that with no interruptions.

Cassandra is linearly scalable with good performance characteristics for very small and very large data stores. Unlike earlier efforts, Cassandra is more than just a key-value store; it is a structured data store which can facilitate complex use cases and queries. Cassandra allows for random access to your data organized into rows and columns.

Cassandra is different, and exciting. This presentation will discuss the pros and cons of using Cassandra, and why it has seen such amazing adoption in the past year.

Bio:
Ben Coverston is Director of Operations at DataStax (formerly knows as Riptano), a provider of software, support, services, training, resources and help for Cassandra. He has been involved in enterprise software his entire career. Working in the airline industry, he helped to build some of the highest volume online booking sites in the world. He saw first hand the consequences of trying to solve real world scalability problems at the limit of what traditional relational databases are capable of.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,139
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
52
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Ben  Coverston   Director  of  Opera2ons  ben.coverston@datastax.com   Hosted  By:   Ma=hew  O’Keefe   MorningStar      
  • 2. History  •  Open  Sourced  by  FB  in  July  2008  •  Apache  Incubator  March  2009  •  Graduated  March  2010  •  Riptano  Founded  April  2010  •  First  Summit  August  2010  •  Riptano  Changed  to  Datastax  January  2011  
  • 3. You  Changed  Your  Name?  Why!?  •  Suits   –  Marke2ng   –  Relevancy   –  Riptano  too  “Skateboard”  •  The  Real  Reason?   –  “The  X  makes  it  sound  cool.”  –  Bender  Bending   Rodriguez,  Futurama  
  • 4. Strengths  •  Scalable  •  Reliable   –  Replica2on  that  works   –  Mul2-­‐DC  Support   –  No  Single  Point  of  Failure  •  Analy2cs  in  the  same  system  as  OLTP  (with   “integrated”  Hadoop  support)  
  • 5. Weaknesses  •  No  ACID  Transac2ons  •  Limited  Support  for  (OLTP)  ad-­‐hoc  queries  •  ..but  you  lost  that  when  you  started  to  shard   your  rela2onal  system.    
  • 6. A  Short  History  of  Big  Data  (Or  Why   Cassandra)  •  Rela2onal  databases  scale  poorly  •  B-­‐trees  are  slow   –  ..and  require  read  before  write.   –  ..hope  your  dataset  fits  in  memory  
  • 7. First  A=empt  
  • 8. We  just  need  to  buy  a  bigger  box…  
  • 9. We  Just  Need  to  Cache  Our  Data…  
  • 10. Add  a  few  more  Databases  
  • 11. A  Li=le  Sharding  
  • 12. What  About  Backup?  
  • 13. Add  Another  Layer  of  Abstrac2on  
  • 14. What  do  we  end  up  with?  (“The  eBay  Architecture,”  Randy  Shoup  and  Dan  Pritche=)  
  • 15. BASE  •  BASE  is  diametrically  opposed  to  ACID.  Where   ACID  is  pessimis2c  and  forces  consistency  at   the  end  of  every  opera2on,  BASE  is  op2mis2c   and  accepts  that  the  database  consistency  will   be  in  a  state  of  flux.  Although  this  sounds   impossible  to  cope  with,  in  reality  it  is  quite   manageable  and  leads  to  levels  of  scalability   that  cannot  be  obtained  with  ACID.   –  Dan  Pritche=  –  NoSQL  Pioneer,  Ebay  Engineer   h=p://queue.acm.org/detail.cfm?id=1394128  
  • 16. Myth  •  Lack  of  ACID  means  that  I  have  to  give  up   transac2onal  guarantees  and  consistency.  •  Paraphrasing:  At  Nellix  we  tend  to  be   op2mis2c.  When  things  don’t  quite  work  out   we  try  again.   –  Siddharth  Andand  •  Achievable  
  • 17. Cassandra  In  Produc2on  •  Nellix  :  Streaming  Bookmarks  •  Digital  Reasoning:  NLP  &  En2ty  Analy2cs  •  OpenX:  largest  publisher-­‐side  ad  network  •  Cloudkick:  performance  data  &  aggrega2on  •  SimpleGeo:  loca2on-­‐as-­‐API  •  Ooyala:  video  analy2cs  and  business  intelligence  •  ngmoco:  massively  mul2player  online  game  worlds  •  Kosmix:  social  media  aggrega2on  •  Reddit:  vote  tracking  system  •  Twi=er:  Rainbird,  geo  data,  analy2cs  •  …  lots  more  
  • 18. Who  is  inves2ng  in  Cassandra?  •  DataStax  •  Twi=er:   –  Were  inves2ng  in  Cassandra  every  day.  Itll  be   with  us  for  a  long  2me  and  our  usage  of  it  will  only   grow.    •  Rackspace  •  >  100  different  individuals  have  submi=ed   patches  to  C*  •  You?  
  • 19. Durability  •  Write  to  Commit  Log   –  fsync  is  cheap  (append  only)   –  Latency  is  only  subject  to  rota2onal  latency   •  Separate  par22on  (no  seeking)   •  SSD  won’t  hurt,  but  it  may  not  help  either.  •  Write  to  memtable  •  Flush  memtable  to  SSTable  
  • 20. Log  Structured  Storage  
  • 21. Tuneable  Consistency  •  One,  Quorum,  All  •  R  +  W  >  N  •  Choose  availability  vs  consistency  (latency)  
  • 22. The  Ring  
  • 23. Adding  A  Node  
  • 24. Adding  A  Node  (Con2nued)  
  • 25. Bootstrapping  
  • 26. Consistent  Hashing  •  Hash  Func2on  -­‐-­‐  K  à  T   –  Let’s  call  this  |k|  (hash  of  k)  for  our  examples  •  Par22oner  Determines  Loca2on  in  the  Ring    
  • 27. Par22oning  
  • 28. Replica2on  •  Simple  Replica2on  Strategy  •  Network  Topology  Strategy   –  How  many  replicas  in  each  datacenter  for  each   keyspace?   –  Generaliza2on  of  Rack  Aware  Strategy  
  • 29. Replica2on  
  • 30. Coordinators  •  Each  Node  can  be  a  coordinator  •  Manages  wri2ng,  read  repair.  •  Success  depends  on  per-­‐call  CL  request  
  • 31. Coordinators  
  • 32. Reliability  •  No  Single  Points  of  Failure  •  Mul2ple  Datacenters  •  Monitorable   –  JMX  (or  whatever  plugs  into  it  –  lots  of  counters)   –  Cac2   –  Munin   –  Nagios  
  • 33. Expecta2on  of  Failure  •  C*  is  designed  to  fail  •  No  “Clean  Shutdown”  •  kill  -­‐9,  it’s  ok.  
  • 34. Failure  
  • 35. Failure  
  • 36. Failure  
  • 37. Hinted  Handoff  
  • 38. Decommission  (RF3)  
  • 39. Repair  
  • 40. Keyspaces  and  ColumnFamilies  •  Loosely  analogous  to  “Schemas”  and  “Tables”  
  • 41. Inside  CFs,  columns  are  dynamic  l  Twi=er:  “Fiveen  months  ago,  it  took  two   weeks  to  perform  ALTER  TABLE  on  the   statuses  [tweets]  table.”  
  • 42. ColumnFamilies  l  Sta2c   l  Object  data  l  Dynamic   l  Precalculated  query  results  
  • 43. “sta2c”  columnfamilies   Userszznate   Password:  *   Name:  Nate   drivx   Password:  *   Name:  Brandon  thobbs   Password:  *   Name:  Tyler  jbellis   Password:  *   Name:  Jonathan   Site:  riptano.com  
  • 44. “dynamic”  columnfamilies   Followingzznate   drivx:   thobbs:   drivx  thobbs   zznate:   pcmanusjbellis   drivx:   mdennis:   thobbs:   xedin:   zznate:   :  
  • 45. Inser2ng  l  Really  “insert  or  update”  l  Not  a  key/value  store  –  update  as  much  of  the   row  as  you  want  
  • 46. Column  indexes  l  Name  vs  range  filters  l  “reversed=true”   l  Special  case:  forward-­‐scan  star2ng  with  beginning   of  row  is  fastest  
  • 47. Example:  Twissandra  •  h=p://twissandra.com  
  • 48. Tweets  RowKey: 92dbeb50-ed45-11df-a6d0-000c29864c4f=> (column=body, value=Four score and seven years ago,timestamp=1289446891681799)=> (column=username, value=alincoln,timestamp=1289446891681799)-------------------RowKey: d418a66e-edc5-11df-ae6c-000c29864c4f=> (column=body, value=Do geese see God?,timestamp=1289501976713199)=> (column=username, value=pdrome,timestamp=1289501976713199)
  • 49. Userline  RowKey: ericflo=> (column=1289446393708810, value=6a0b4834-ed44-11df-bc31-000c29864c4f, timestamp=1289446393710212)=> (column=1289446397693831, value=6c6b5916-ed44-11df-bc31-000c29864c4f, timestamp=1289446397694646)=> (column=1289446891681780, value=92dbeb50-ed45-11df-a6d0-000c29864c4f, timestamp=1289446891685065)=> (column=1289446897315887, value=96379f92-ed45-11df-a6d0-000c29864c4f, timestamp=1289446897317676)
  • 50. Userline   1289847840615:  3f19757a-­‐zznate   1289847887086:  a20fcf52-­‐595c...   c89d...   drivx  thobbs   1289847887086:  a20fcf52-­‐595c...   1289847840615:  3f19757a-­‐ 1289847844275:  844e75e2-­‐jbellis   c89d...   b546...  
  • 51. Timeline  RowKey: ericflo=> (column=1289446393708810, value=6a0b4834-ed44-11df-bc31-000c29864c4f, timestamp=1289446393710212)=> (column=1289446397693831, value=6c6b5916-ed44-11df-bc31-000c29864c4f, timestamp=1289446397694646)=> (column=1289446891681780, value=92dbeb50-ed45-11df-a6d0-000c29864c4f, timestamp=1289446891685065)=> (column=1289446897315887, value=96379f92-ed45-11df-a6d0-000c29864c4f, timestamp=1289446897317676)
  • 52. Adding  a  tweet  tweet_id = str(uuid())body = @ericflo thanks for Twissandra, it helps!timestamp = long(time.time() * 1e6)columns = {uname: useruuid, body: body}TWEET.insert(tweet_id, columns)columns = {ts: tweet_id}USERLINE.insert(uname, columns)TIMELINE.insert(uname, columns)for follower_uname in FOLLOWERS.get(uname, 5000): TIMELINE.insert(follower_uname, columns)
  • 53. Reads  timeline = USERLINE.get(uname, column_reversed=True)tweets = TWEET.multiget(timeline.values())start = request.GET.get(start)limit = NUM_PER_PAGEtimeline = TIMELINE.get(uname, column_start=start,column_count=limit, column_reversed=True)tweets = TWEET.multiget(timeline.values())
  • 54. I  can  has  smarter  clients?  l  Dont  use  thriv  directly  l  Higher  level  clients  have  a  lot  of  features  you   want   l  Knowledge  about  data  types   l  Connec2on  pooling   l  Automa2c  retries   l  Logging  
  • 55. Raw  thriv  API:  Connec2ng  def get_client(host=127.0.0.1, port=9170): socket = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(socket) transport.open() protocol =TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) return client
  • 56. Raw  thriv  API:  Inser2ng  data = {id: useruuid, ...}columns = [Column(k, v, time.time()) for (k, v) in data.items()]mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns]rows = {useruuid: {User: mutations}}client.batch_mutate(Twissandra, rows,ConsistencyLevel.ONE)
  • 57. Raw  thriv  API:  Fetching  l  get,  get_slice,  get_count,  mul2get_slice,   get_range_slices  l  ColumnOrSuperColumn  l  h=p://wiki.apache.org/cassandra/API    
  • 58. API  layers  Layer   Analog  libpq   Thriv  JDBC   Hector  JPA   Kundera  
  • 59. Language  support  l  Python   l  pycassa   l  telephus  l  Ruby   l  Speed  is  a  nega2ve  l  Java   l  Hector  l  PHP  (soon  with  less  suckage!)  
  • 60. Done  yet?  l  S2ll  doing  1+N  queries  per  page  l  Solu2on:  Supercolumns  l  Err..  Well  maybe…  
  • 61. Supercolumns:  limita2ons  l  Requires  reading  an  en2re  SC  (not  the  en2re   row)  from  disk  even  if  you  just  want  one   subcolumn  l  No  Secondary  Indexes  l  It’s  just  an  extra  map  layer.  l  Probably  best  to  avoid  them  if  you  can.    
  • 62. UUIDs  l  Column  names  should  be  uuids,  not  longs,  to   avoid  collisions  l  Version  1  UUIDs  can  be  sorted  by  2me   (“TimeUUID”)  l  Any  UUID  can  be  sorted  by  its  raw  bytes   (“LexicalUUID”)   l  Usually  Version  4   l  Slightly  less  overhead  
  • 63. 0.7: secondary indexes Obviate need for Userline (but not Timeline)l 
  • 64. Lucandra  l  What  documents  contain  term  X?   l  …  and  term  Y?   l  …  or  start  with  Z?  
  • 65. FAQ:  coun2ng  l  UUIDs  +  batch  process  l  Mutex  (contrib/mutex  or  “cages”)  l  Use  redis  or  mysql  or  memcached  l  column-­‐per-­‐app-­‐server  l  counter  API  (aver  .7  is  out)  
  • 66. Tips  l  Insert  instead  of  check-­‐then-­‐insert  l  Use  client-­‐side  clock  to  your  advantage  l  use  TTL  l  Wider  rows  (but  not  too  wide)    l  Start  with  queries,  work  backwards  l  Avoid  storing  extra  “2mestamp”  columns