0
BANGALORE CASSANDRA UG APRIL 2013CASSANDRA INTERNALS &    PERFORMANCE                           Aaron Morton              ...
Architecture   Code
Cassandra Architecture                             Clients                              APIs                          Clus...
Cassandra Cluster Architecture                     Clients                      APIs             APIs                  Clu...
Dynamo Cluster Architecture                  Clients                   APIs       APIs                  Dynamo      Dynamo...
Architecture    API Dynamo Database
API Transports                    Thrift                 Native Binary                  Read Line                     RMI
Thrift Transport   //Custom TServer implementations   o.a.c.thrift.CustomTThreadPoolServer   o.a.c.thrift.CustomTNonBlocki...
API Transports                     Thrift                 Native Binary                  Read Line                      RMI
Native Binary Transport         Beta in Cassandra 1.2           Uses Netty 3.5             Enabled with   start_native_tra...
o.a.c.transport.Server.run()   //Setup the Netty server   new ExecutionHandler()   new NioServerSocketChannelFactory()   S...
o.a.c.transport.Message.Dispatcher.messageReceived()   //Process message from client   ServerConnection.validateNewMessage...
o.a.c.transport.messages   CredentialsMessage()   EventMessage()   ExecuteMessage()   PrepareMessage()   QueryMessage()   ...
Messages  Defined in the Native Binary           Protocol $SRC/doc/native_protocol.spec
API Services                JMX                CLI               Thrift               CQL 3
JMX Management Beans Spread around the code base.   Interfaces named *MBean
JMX Management Beans    Registered with the names             such as     org.apache.cassandra.db:        type=StorageProxy
API Services                JMX                CLI               Thrift               CQL 3
o.a.c.cli.CliMain.main()  // Connect to server to read input  this.connect()  this.evaluateFileStatements()  this.processS...
CLI Grammar         ANTLR Grammar  $SRC/src/java/o/a/c/cli/CLI.g
o.a.c.cli.CliClient.executeCLIStatement()   // Process statement   CliCompiler.compileQuery() #ANTLR   switch (tree.getTyp...
API Services                JMX                CLI               Thrift               CQL 3
o.a.c.thrift.CassandraServer  // Implements Thrift Interface  // Access control  // Input validation  // Mapping to/from T...
Thrift Interface                   Thrift IDL$SRC/interface/cassandra.thrift
o.a.c.thrift.CassandraServer.get_slice()  // get columns for one row  Tracing.begin()  ClientState cState = state()  cStat...
CassandraServer.multigetSliceInternal()  // get columns for may rows  ThriftValidation.validate*()  // Create ReadCommands...
CassandraServer.getSlice()  // Process ReadCommands  // return Thrift types  readColumnFamily()  thriftifyColumnFamily()
CassandraServer.readColumnFamily()  // Process ReadCommands  // Return ColumnFamilies  StorageProxy.read()
API Services                JMX                CLI               Thrift               CQL 3
o.a.c.cql3.QueryProcessor  // Prepares and executes CQL3 statements  // Used by Thrift & Native transports  // Access cont...
CQL3 Grammar         ANTLR Grammar       $SRC/o.a.c.cql3/Cql.g
o.a.c.cql3.statements.ParsedStatement  // Subclasses generated by ANTLR  // Tracks bound term count  // Prepare CQLStateme...
o.a.c.cql3.statements.CQLStatement  checkAccess(ClientState state)  validate(ClientState state)  execute(ConsistencyLevel ...
o.a.c.cql3.functions.Function  argsType()  returnType()  execute(List<ByteBuffer>          parameters)
statements.SelectStatement.RawStatement  // Implements ParsedStatement  // Input validation  prepare()
statements.SelectStatement.execute()  // Create ReadCommands  StorageProxy.read()
Architecture    API Dynamo Database
Dynamo Layer               o.a.c.service                 o.a.c.net                 o.a.c.dht               o.a.c.locator  ...
o.a.c.service.StorageProxy  // Cluster wide storage operations  // Select endpoints & check CL available  // Send messages...
o.a.c.service.StorageService  // Ring operations  // Track ring state  // Start & stop ring membership  // Node & token qu...
o.a.c.service.IResponseResolver  preprocess(MessageIn<T> message)  resolve() throws   DigestMismatchException  RowDigestRe...
Response Handlers / Callback  implements IAsyncCallback<T>  response(MessageIn<T> msg)
o.a.c.service.ReadCallback.get()  //Wait for blockfor & data  condition.await(timeout,   TimeUnit.MILLISECONDS)  throw Rea...
o.a.c.service.StorageProxy.fetchRows()  getLiveSortedEndpoints()  new RowDigestResolver()  new ReadCallback()  MessagingSe...
Dynamo Layer               o.a.c.service                 o.a.c.net                 o.a.c.dht               o.a.c.locator  ...
o.a.c.net.MessagingService.verb<<enum>>  MUTATION  READ  REQUEST_RESPONSE  TREE_REQUEST  TREE_RESPONSE                (And...
o.a.c.net.MessagingService.verbHandlers  new EnumMap<Verb,     IVerbHandler>(Verb.class)
o.a.c.net.IVerbHandler<T>  doVerb(MessageIn<T> message,         String id);
o.a.c.net.MessagingService.verbStages  new EnumMap<MessagingService.Verb,      Stage>(MessagingService.Verb.class)
o.a.c.net.MessagingService.receive()  runnable = new MessageDeliveryTask(    message, id, timestamp);  StageManager.getSta...
o.a.c.net.MessageDeliveryTask.run()  // If dropable and rpc_timeout  MessagingService.incrementDroppedMessages(verb);  Mes...
Dynamo Layer               o.a.c.service                 o.a.c.net                 o.a.c.dht               o.a.c.locator  ...
o.a.c.dht.IPartitioner<T extends Token>  getToken(ByteBuffer key)  getRandomToken()  LocalPartitioner  RandomPartitioner  ...
o.a.c.dht.Token<T>  compareTo(Token<T> o)  BytesToken  BigIntegerToken  LongToken
Dynamo Layer               o.a.c.service                 o.a.c.net                 o.a.c.dht               o.a.c.locator  ...
o.a.c.locator.IEndpointSnitch  getRack(InetAddress endpoint)  getDatacenter(InetAddress endpoint)  sortByProximity(InetAdd...
o.a.c.locator.AbstractReplicationStrategy  getNaturalEndpoints(      RingPosition searchPosition)  calculateNaturalEndpoin...
o.a.c.locator.TokenMetadata  BiMultiValMap<Token, InetAddress>      tokenToEndpointMap  BiMultiValMap<Token, InetAddress> ...
Dynamo Layer               o.a.c.service                 o.a.c.net                 o.a.c.dht               o.a.c.locator  ...
o.a.c.gms.VersionedValue  // VersionGenerator.getNextVersion()  public final int version;  public final String value;
o.a.c.gms.ApplicationState<<enum>>  STATUS  LOAD  SCHEMA  DC  RACK                 (And more...)
o.a.c.gms.HeartBeatState  //VersionGenerator.getNextVersion();  private int generation;  private int version;
o.a.c.gms.Gossiper.GossipTask.run()  // SYN -> ACK -> ACK2  makeRandomGossipDigest()  new GossipDigestSyn()  // Use Messag...
gms.GossipDigestSynVerbHandler.doVerb()  Gossiper.examineGossiper()  new GossipDigestAck()  MessagingService.sendOneWay()
gms.GossipDigestAckVerbHandler.doVerb()  Gossiper.notifyFailureDetector()  Gossiper.applyStateLocally()  Gossiper.makeGoss...
gms.GossipDigestAcksVerbHandler.doVerb()  Gossiper.notifyFailureDetector()  Gossiper.applyStateLocally()
Architecture  API LayerDynamo LayerDatabase Layer
Database Layer                 o.a.c.concurrent                      o.a.c.db                   o.a.c.cache               ...
o.a.c.concurrent.StageManager  stages = new EnumMap<Stage,    ThreadPoolExecutor>(Stage.class);  getStage(Stage stage)
o.a.c.concurrent.Stage  READ  MUTATION  GOSSIP  REQUEST_RESPONSE  ANTI_ENTROPY                (And more...)
Database Layer                 o.a.c.concurrent                      o.a.c.db                   o.a.c.cache               ...
o.a.c.db.Table  // Keyspace  open(String table)  getColumnFamilyStore(String cfName)  getRow(QueryFilter filter)  apply(Ro...
o.a.c.db.ColumnFamilyStore  // Column Family  getColumnFamily(QueryFilter filter)  getTopLevelColumns(...)  apply(Decorate...
o.a.c.db.IColumnContainer  addColumn(IColumn column)  remove(ByteBuffer columnName)  ColumnFamily  SuperColumn
o.a.c.db.ISortedColumns  addColumn(IColumn column,            Allocator allocator)  removeColumn(ByteBuffer name)  ArrayBa...
o.a.c.db.Memtable  put(DecoratedKey key,      ColumnFamily columnFamily,      SecondaryIndexManager.Updater      indexer) ...
Memtable.FlushRunnable.writeSortedContents()  // SSTableWriter  createFlushWriter()  // Iterate through rows & CF’s in ord...
o.a.c.db.ReadCommand  getRow(Table table)  SliceByNamesReadCommand  SliceFromReadCommand
o.a.c.db.IDiskAtomFilter  getMemtableColumnIterator(...)  getSSTableColumnIterator(...)  IdentityQueryFilter  NamesQueryFi...
Some query performance...
Today.         Write Path         Read Path
memtable_flush_queue_size test...         m1.xlarge Cassandra node            m1.xlarge client node       1 CF with 6 Seco...
CF write latency and memtable_flush_queue_size...                                  memtable_flush_queue_size=7   memtable_fl...
Request latency and memtable_flush_queue_size...                                     memtable_flush_queue_size=7   memtable...
durable_writes test...                10,000 Inserts,             50 Columns per Row             50 bytes per Column
Request latency and durable_writes (1 client)...                                  enabled          disabled               ...
Request latency and durable_writes (10 clients)...                                   enabled          disabled            ...
Request latency and durable_writes (20 clients)...                                   enabled          disabled            ...
CommitLog tests...           10,000 Inserts,        50 Columns per Row        50 bytes per Column
periodic commit log adds mutation to     queue then acknowledges. Commit Log is appended to by a single     thread, sync i...
Request latency and commitlog_sync_period_in_ms...                                 10,000 ms          10 ms               ...
batch commit log adds mutation to queue     and waits before acknowledging.  Writer thread processes mutations forcommitlo...
Request latency comparing periodic and batch sync...                                 periodic          batch              ...
Merge mutation...  Row level Isolation provided       via SnapTree.                    (https://github.com/nbronson/snaptr...
Row concurrency tests...     10,000 Columns per Row       50 bytes per Column      50 Columns per Insert
CF Write Latency and row concurrency (10 clients)...                                 different rows          single row   ...
Secondary Indexes...   synchronized access to        indexed rows.                       (Keyspace wide)
Index concurrency tests...                  CF with 2 Indexes                    10,000 Inserts                 6 Columns ...
Request latency and index concurrency (10 clients)...                                 different rows          single row  ...
Index tests...              10,000 Inserts           50 Columns per Row           50 bytes per Column
Request latency and secondary indexes...                                 no indexes          six indexes                  ...
Today        Write Path        Read Path
bloom_filter_fp_chance tests...               1,000,000 Rows             50 Columns per Row             50 bytes per Colum...
CF read latency and bloom_filter_fp_chance...                                 default 0.000744.          0.1              ...
key_cache_size_in_mb tests...           10,000 Rows       50 Columns per Row       50 bytes per Column          Read all R...
CF read latency and key_cache_size_in_mb...                                 default (100MB) 100% Hit Rate          disable...
index_interval tests...                100,000 Rows             50 Columns per Row             50 bytes per Column        ...
CF read latency and index_interval...                                  index_interval=128 (default)          index_interva...
row_cache_size_in_mb tests...          100,000 Rows       50 Columns per Row       50 bytes per Column          Read all R...
CF read latency and row_cache_size_in_mb...                                 row_cache_size_in_mb=0 and key_cache_size_in_m...
Column Index tests...    Read first Column by name from 1,200                  Columns.  Read first Column by name from 1,00...
CF read latency and Column Index...                                 First Column from 1,200          First Column from 1,0...
Name Locality tests...                  1,000,000 Columns                 50 bytes per Column   Read 100 Columns from midd...
CF read latency and name locality...                                   Adjacent Columns          Spread Columns           ...
Start position tests...                    1,000,000 Columns                   50 bytes per Column      Read first 100 Colu...
CF read latency and start position...                                  Without start position          With start position...
Start offset tests...                     1,000,000 Columns                    50 bytes per Column        Read first 100 Co...
CF read latency and start offset...                                  First          MIddle                      40,000    ...
Start offset tests...                     1,000,000 Columns                    50 bytes per Column      Read first 100 Colu...
CF read latency and reversed...                                  Forward          Reversed                      40,000    ...
Thanks.
Aaron Morton                     @aaronmorton                   www.thelastpickle.comLicensed under a Creative Commons Att...
Upcoming SlideShare
Loading in...5
×

Apache Cassandra in Bangalore - Cassandra Internals and Performance

851

Published on

Slides from http://www.meetup.com/Apache-Cassandra/events/108524582/

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
851
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
48
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Cassandra in Bangalore - Cassandra Internals and Performance"

  1. 1. BANGALORE CASSANDRA UG APRIL 2013CASSANDRA INTERNALS & PERFORMANCE Aaron Morton @aaronmorton www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
  2. 2. Architecture Code
  3. 3. Cassandra Architecture Clients APIs Cluster Aware Cluster Unaware Disk
  4. 4. Cassandra Cluster Architecture Clients APIs APIs Cluster Aware Cluster Aware Cluster Unaware Cluster Unaware Disk Disk Node 1 Node 2
  5. 5. Dynamo Cluster Architecture Clients APIs APIs Dynamo Dynamo Database Database Disk Disk Node 1 Node 2
  6. 6. Architecture API Dynamo Database
  7. 7. API Transports Thrift Native Binary Read Line RMI
  8. 8. Thrift Transport //Custom TServer implementations o.a.c.thrift.CustomTThreadPoolServer o.a.c.thrift.CustomTNonBlockingServer o.a.c.thrift.CustomTHsHaServer
  9. 9. API Transports Thrift Native Binary Read Line RMI
  10. 10. Native Binary Transport Beta in Cassandra 1.2 Uses Netty 3.5 Enabled with start_native_transport (Disabled by default)
  11. 11. o.a.c.transport.Server.run() //Setup the Netty server new ExecutionHandler() new NioServerSocketChannelFactory() ServerBootstrap.setPipelineFactory()
  12. 12. o.a.c.transport.Message.Dispatcher.messageReceived() //Process message from client ServerConnection.validateNewMessage() Request.execute() ServerConnection.applyStateTransition() Channel.write()
  13. 13. o.a.c.transport.messages CredentialsMessage() EventMessage() ExecuteMessage() PrepareMessage() QueryMessage() ResultMessage() (And more...)
  14. 14. Messages Defined in the Native Binary Protocol $SRC/doc/native_protocol.spec
  15. 15. API Services JMX CLI Thrift CQL 3
  16. 16. JMX Management Beans Spread around the code base. Interfaces named *MBean
  17. 17. JMX Management Beans Registered with the names such as org.apache.cassandra.db: type=StorageProxy
  18. 18. API Services JMX CLI Thrift CQL 3
  19. 19. o.a.c.cli.CliMain.main() // Connect to server to read input this.connect() this.evaluateFileStatements() this.processStatementInteractive()
  20. 20. CLI Grammar ANTLR Grammar $SRC/src/java/o/a/c/cli/CLI.g
  21. 21. o.a.c.cli.CliClient.executeCLIStatement() // Process statement CliCompiler.compileQuery() #ANTLR switch (tree.getType()) case...
  22. 22. API Services JMX CLI Thrift CQL 3
  23. 23. o.a.c.thrift.CassandraServer // Implements Thrift Interface // Access control // Input validation // Mapping to/from Thrift and internal types
  24. 24. Thrift Interface Thrift IDL$SRC/interface/cassandra.thrift
  25. 25. o.a.c.thrift.CassandraServer.get_slice() // get columns for one row Tracing.begin() ClientState cState = state() cState.hasColumnFamilyAccess() multigetSliceInternal()
  26. 26. CassandraServer.multigetSliceInternal() // get columns for may rows ThriftValidation.validate*() // Create ReadCommands getSlice()
  27. 27. CassandraServer.getSlice() // Process ReadCommands // return Thrift types readColumnFamily() thriftifyColumnFamily()
  28. 28. CassandraServer.readColumnFamily() // Process ReadCommands // Return ColumnFamilies StorageProxy.read()
  29. 29. API Services JMX CLI Thrift CQL 3
  30. 30. o.a.c.cql3.QueryProcessor // Prepares and executes CQL3 statements // Used by Thrift & Native transports // Access control // Input validation // Returns transport.ResultMessage
  31. 31. CQL3 Grammar ANTLR Grammar $SRC/o.a.c.cql3/Cql.g
  32. 32. o.a.c.cql3.statements.ParsedStatement // Subclasses generated by ANTLR // Tracks bound term count // Prepare CQLStatement prepare()
  33. 33. o.a.c.cql3.statements.CQLStatement checkAccess(ClientState state) validate(ClientState state) execute(ConsistencyLevel cl, QueryState state, List<ByteBuffer> variables)
  34. 34. o.a.c.cql3.functions.Function argsType() returnType() execute(List<ByteBuffer> parameters)
  35. 35. statements.SelectStatement.RawStatement // Implements ParsedStatement // Input validation prepare()
  36. 36. statements.SelectStatement.execute() // Create ReadCommands StorageProxy.read()
  37. 37. Architecture API Dynamo Database
  38. 38. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  39. 39. o.a.c.service.StorageProxy // Cluster wide storage operations // Select endpoints & check CL available // Send messages to Stages // Wait for response // Store Hints
  40. 40. o.a.c.service.StorageService // Ring operations // Track ring state // Start & stop ring membership // Node & token queries
  41. 41. o.a.c.service.IResponseResolver preprocess(MessageIn<T> message) resolve() throws DigestMismatchException RowDigestResolver RowDataResolver RangeSliceResponseResolver
  42. 42. Response Handlers / Callback implements IAsyncCallback<T> response(MessageIn<T> msg)
  43. 43. o.a.c.service.ReadCallback.get() //Wait for blockfor & data condition.await(timeout, TimeUnit.MILLISECONDS) throw ReadTimeoutException() resolver.resolve()
  44. 44. o.a.c.service.StorageProxy.fetchRows() getLiveSortedEndpoints() new RowDigestResolver() new ReadCallback() MessagingService.sendRR() --------------------------------------- ReadCallback.get() # blocking catch (DigestMismatchException ex) catch (ReadTimeoutException ex)
  45. 45. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  46. 46. o.a.c.net.MessagingService.verb<<enum>> MUTATION READ REQUEST_RESPONSE TREE_REQUEST TREE_RESPONSE (And more...)
  47. 47. o.a.c.net.MessagingService.verbHandlers new EnumMap<Verb, IVerbHandler>(Verb.class)
  48. 48. o.a.c.net.IVerbHandler<T> doVerb(MessageIn<T> message, String id);
  49. 49. o.a.c.net.MessagingService.verbStages new EnumMap<MessagingService.Verb, Stage>(MessagingService.Verb.class)
  50. 50. o.a.c.net.MessagingService.receive() runnable = new MessageDeliveryTask( message, id, timestamp); StageManager.getStage( message.getMessageType()); stage.execute(runnable);
  51. 51. o.a.c.net.MessageDeliveryTask.run() // If dropable and rpc_timeout MessagingService.incrementDroppedMessages(verb); MessagingService.getVerbHandler(verb) verbHandler.doVerb(message, id)
  52. 52. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  53. 53. o.a.c.dht.IPartitioner<T extends Token> getToken(ByteBuffer key) getRandomToken() LocalPartitioner RandomPartitioner Murmur3Partitioner
  54. 54. o.a.c.dht.Token<T> compareTo(Token<T> o) BytesToken BigIntegerToken LongToken
  55. 55. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  56. 56. o.a.c.locator.IEndpointSnitch getRack(InetAddress endpoint) getDatacenter(InetAddress endpoint) sortByProximity(InetAddress address, List<InetAddress> addresses) SimpleSnitch PropertyFileSnitch Ec2MultiRegionSnitch
  57. 57. o.a.c.locator.AbstractReplicationStrategy getNaturalEndpoints( RingPosition searchPosition) calculateNaturalEndpoints(Token searchToken, TokenMetadata tokenMetadata) SimpleStrategy NetworkTopologyStrategy
  58. 58. o.a.c.locator.TokenMetadata BiMultiValMap<Token, InetAddress> tokenToEndpointMap BiMultiValMap<Token, InetAddress> bootstrapTokens Set<InetAddress> leavingEndpoints
  59. 59. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  60. 60. o.a.c.gms.VersionedValue // VersionGenerator.getNextVersion() public final int version; public final String value;
  61. 61. o.a.c.gms.ApplicationState<<enum>> STATUS LOAD SCHEMA DC RACK (And more...)
  62. 62. o.a.c.gms.HeartBeatState //VersionGenerator.getNextVersion(); private int generation; private int version;
  63. 63. o.a.c.gms.Gossiper.GossipTask.run() // SYN -> ACK -> ACK2 makeRandomGossipDigest() new GossipDigestSyn() // Use MessagingService.sendOneWay() Gossiper.doGossipToLiveMember() Gossiper.doGossipToUnreachableMember() Gossiper.doGossipToSeed()
  64. 64. gms.GossipDigestSynVerbHandler.doVerb() Gossiper.examineGossiper() new GossipDigestAck() MessagingService.sendOneWay()
  65. 65. gms.GossipDigestAckVerbHandler.doVerb() Gossiper.notifyFailureDetector() Gossiper.applyStateLocally() Gossiper.makeGossipDigestAck2Message()
  66. 66. gms.GossipDigestAcksVerbHandler.doVerb() Gossiper.notifyFailureDetector() Gossiper.applyStateLocally()
  67. 67. Architecture API LayerDynamo LayerDatabase Layer
  68. 68. Database Layer o.a.c.concurrent o.a.c.db o.a.c.cache o.a.c.io o.a.c.trace
  69. 69. o.a.c.concurrent.StageManager stages = new EnumMap<Stage, ThreadPoolExecutor>(Stage.class); getStage(Stage stage)
  70. 70. o.a.c.concurrent.Stage READ MUTATION GOSSIP REQUEST_RESPONSE ANTI_ENTROPY (And more...)
  71. 71. Database Layer o.a.c.concurrent o.a.c.db o.a.c.cache o.a.c.io o.a.c.trace
  72. 72. o.a.c.db.Table // Keyspace open(String table) getColumnFamilyStore(String cfName) getRow(QueryFilter filter) apply(RowMutation mutation, boolean writeCommitLog)
  73. 73. o.a.c.db.ColumnFamilyStore // Column Family getColumnFamily(QueryFilter filter) getTopLevelColumns(...) apply(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer)
  74. 74. o.a.c.db.IColumnContainer addColumn(IColumn column) remove(ByteBuffer columnName) ColumnFamily SuperColumn
  75. 75. o.a.c.db.ISortedColumns addColumn(IColumn column, Allocator allocator) removeColumn(ByteBuffer name) ArrayBackedSortedColumns AtomicSortedColumns TreeMapBackedSortedColumns
  76. 76. o.a.c.db.Memtable put(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer) flushAndSignal(CountDownLatch latch, Future<ReplayPosition> context)
  77. 77. Memtable.FlushRunnable.writeSortedContents() // SSTableWriter createFlushWriter() // Iterate through rows & CF’s in order writer.append()
  78. 78. o.a.c.db.ReadCommand getRow(Table table) SliceByNamesReadCommand SliceFromReadCommand
  79. 79. o.a.c.db.IDiskAtomFilter getMemtableColumnIterator(...) getSSTableColumnIterator(...) IdentityQueryFilter NamesQueryFilter SliceQueryFilter
  80. 80. Some query performance...
  81. 81. Today. Write Path Read Path
  82. 82. memtable_flush_queue_size test... m1.xlarge Cassandra node m1.xlarge client node 1 CF with 6 Secondary Indexes 1 Client Thread 10,000 Inserts, 100 Columns per Row 1100 bytes per Column
  83. 83. CF write latency and memtable_flush_queue_size... memtable_flush_queue_size=7 memtable_flush_queue_size=1 1,200 900Latency Microseconds 600 300 0 85th 95th 99th 100th
  84. 84. Request latency and memtable_flush_queue_size... memtable_flush_queue_size=7 memtable_flush_queue_size=1 5,000,000 3,750,000Latecy Microseconds 2,500,000 1,250,000 0 85th 95th 99th 100th
  85. 85. durable_writes test... 10,000 Inserts, 50 Columns per Row 50 bytes per Column
  86. 86. Request latency and durable_writes (1 client)... enabled disabled 7,000 5,250Latency Microseconds 3,500 1,750 0 85th 95th 99th
  87. 87. Request latency and durable_writes (10 clients)... enabled disabled 30,000 22,500Latency Microseconds 15,000 7,500 0 85th 95th 99th
  88. 88. Request latency and durable_writes (20 clients)... enabled disabled 90,000 67,500Latency Microseconds 45,000 22,500 0 85th 95th 99th
  89. 89. CommitLog tests... 10,000 Inserts, 50 Columns per Row 50 bytes per Column
  90. 90. periodic commit log adds mutation to queue then acknowledges. Commit Log is appended to by a single thread, sync is called everycommitlog_sync_period_in_ms.
  91. 91. Request latency and commitlog_sync_period_in_ms... 10,000 ms 10 ms 220 208 Latecy Microseconds 195 183 170 85th 95th 99th
  92. 92. batch commit log adds mutation to queue and waits before acknowledging. Writer thread processes mutations forcommitlog_sync_batch_window_in_ ms duration, then syncs, then signals.
  93. 93. Request latency comparing periodic and batch sync... periodic batch 800 600 Latecy Microseconds 400 200 0 85th 95th 99th
  94. 94. Merge mutation... Row level Isolation provided via SnapTree. (https://github.com/nbronson/snaptree)
  95. 95. Row concurrency tests... 10,000 Columns per Row 50 bytes per Column 50 Columns per Insert
  96. 96. CF Write Latency and row concurrency (10 clients)... different rows single row 2,000 1,500Latecy Microseconds 1,000 500 0 85th 95th 99th
  97. 97. Secondary Indexes... synchronized access to indexed rows. (Keyspace wide)
  98. 98. Index concurrency tests... CF with 2 Indexes 10,000 Inserts 6 Columns per Row 35 bytes per Column Alternating column values
  99. 99. Request latency and index concurrency (10 clients)... different rows single row 4,000 3,000Latecy Microseconds 2,000 1,000 0 85th 95th 99th
  100. 100. Index tests... 10,000 Inserts 50 Columns per Row 50 bytes per Column
  101. 101. Request latency and secondary indexes... no indexes six indexes 3,000 2,250Latecy Microseconds 1,500 750 0 85th 95th 99th
  102. 102. Today Write Path Read Path
  103. 103. bloom_filter_fp_chance tests... 1,000,000 Rows 50 Columns per Row 50 bytes per Column commitlog_total_space_in_mb: 1 Read random 10% of rows.
  104. 104. CF read latency and bloom_filter_fp_chance... default 0.000744. 0.1 7,000 5,250Latecy Microseconds 3,500 1,750 0 85th 95th 99th
  105. 105. key_cache_size_in_mb tests... 10,000 Rows 50 Columns per Row 50 bytes per Column Read all Rows
  106. 106. CF read latency and key_cache_size_in_mb... default (100MB) 100% Hit Rate disabled 300 225 Latecy Microseconds 150 75 0 85th 95th 99th
  107. 107. index_interval tests... 100,000 Rows 50 Columns per Row 50 bytes per Column key_cache_size_in_mb: 0 Read 1 Column from random 10% of Rows
  108. 108. CF read latency and index_interval... index_interval=128 (default) index_interval=512 20,000 15,000Latecy Microseconds 10,000 5,000 0 85th 95th 99th
  109. 109. row_cache_size_in_mb tests... 100,000 Rows 50 Columns per Row 50 bytes per Column Read all Rows
  110. 110. CF read latency and row_cache_size_in_mb... row_cache_size_in_mb=0 and key_cache_size_in_mb=100mb row_cache_size_in_mb=100mb and key_cache_size_in_mb=0 260 195 Latecy Microseconds 130 65 0 85th 95th 99th
  111. 111. Column Index tests... Read first Column by name from 1,200 Columns. Read first Column by name from 1,000,000 Columns.
  112. 112. CF read latency and Column Index... First Column from 1,200 First Column from 1,000,000 6,000 4,500Latecy Microseconds 3,000 1,500 0 85th 95th 99th
  113. 113. Name Locality tests... 1,000,000 Columns 50 bytes per Column Read 100 Columns from middle of row. Read 100 Columns from spread across row.
  114. 114. CF read latency and name locality... Adjacent Columns Spread Columns 200,000 150,000Latecy Microseconds 100,000 50,000 0 85th 95th 99th
  115. 115. Start position tests... 1,000,000 Columns 50 bytes per Column Read first 100 Columns without start. Read first 100 Columns with start.
  116. 116. CF read latency and start position... Without start position With start position 40,000 30,000Latecy Microseconds 20,000 10,000 0 85th 95th 99th
  117. 117. Start offset tests... 1,000,000 Columns 50 bytes per Column Read first 100 Columns with start. Read middle 100 Columns with start.
  118. 118. CF read latency and start offset... First MIddle 40,000 30,000Latecy Microseconds 20,000 10,000 0 85th 95th 99th
  119. 119. Start offset tests... 1,000,000 Columns 50 bytes per Column Read first 100 Columns without start. Read last 100 Columns with reversed.
  120. 120. CF read latency and reversed... Forward Reversed 40,000 30,000Latecy Microseconds 20,000 10,000 0 85th 95th 99th
  121. 121. Thanks.
  122. 122. Aaron Morton @aaronmorton www.thelastpickle.comLicensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×