Apache Con NA 2013 - Cassandra Internals

5,449 views

Published on

Talk from ApacheCon North America 2013 on Cassandra Internals by Aaron Morton.

Published in: Technology
3 Comments
17 Likes
Statistics
Notes
No Downloads
Views
Total views
5,449
On SlideShare
0
From Embeds
0
Number of Embeds
167
Actions
Shares
0
Downloads
138
Comments
3
Likes
17
Embeds 0
No embeds

No notes for slide

Apache Con NA 2013 - Cassandra Internals

  1. 1. APACHECON NORTH AMERICA 2013 CASSANDRA INTERNALS Aaron Morton @aaronmorton www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
  2. 2. About Me Freelance Cassandra Consultant Based in Wellington, New Zealand Apache Cassandra Committer Data Stax MVP for Apache Cassandra
  3. 3. Architecture Code
  4. 4. Cassandra Architecture Clients APIs Cluster Aware Cluster Unaware Disk
  5. 5. Cassandra Cluster Architecture Clients APIs APIs Cluster Aware Cluster Aware Cluster Unaware Cluster Unaware Disk Disk Node 1 Node 2
  6. 6. Dynamo Cluster Architecture Clients APIs APIs Dynamo Dynamo Database Database Disk Disk Node 1 Node 2
  7. 7. Architecture API Dynamo Database
  8. 8. API Transports Thrift Native Binary Read Line RMI
  9. 9. Thrift Transport //Custom TServer implementations o.a.c.thrift.CustomTThreadPoolServer o.a.c.thrift.CustomTNonBlockingServer o.a.c.thrift.CustomTHsHaServer
  10. 10. API Transports Thrift Native Binary Read Line RMI
  11. 11. Native Binary Transport Beta in Cassandra 1.2 Uses Netty 3.5 Enabled with start_native_transport (Disabled by default)
  12. 12. o.a.c.transport.Server.run() //Setup the Netty server new ExecutionHandler() new NioServerSocketChannelFactory() ServerBootstrap.setPipelineFactory()
  13. 13. o.a.c.transport.Message.Dispatcher.messageReceived() //Process message from client ServerConnection.validateNewMessage() Request.execute() ServerConnection.applyStateTransition() Channel.write()
  14. 14. o.a.c.transport.messages CredentialsMessage() EventMessage() ExecuteMessage() PrepareMessage() QueryMessage() ResultMessage() (And more...)
  15. 15. Messages Defined in the Native Binary Protocol $SRC/doc/native_protocol.spec
  16. 16. API Services JMX CLI Thrift CQL 3
  17. 17. JMX Management Beans Spread around the code base. Interfaces named *MBean
  18. 18. JMX Management Beans Registered with the names such as org.apache.cassandra.db: type=StorageProxy
  19. 19. API Services JMX CLI Thrift CQL 3
  20. 20. o.a.c.cli.CliMain.main() // Connect to server to read input this.connect() this.evaluateFileStatements() this.processStatementInteractive()
  21. 21. CLI Grammar ANTLR Grammar $SRC/src/java/o/a/c/cli/CLI.g
  22. 22. o.a.c.cli.CliClient.executeCLIStatement() // Process statement CliCompiler.compileQuery() #ANTLR switch (tree.getType()) case...
  23. 23. API Services JMX CLI Thrift CQL 3
  24. 24. o.a.c.thrift.CassandraServer // Implements Thrift Interface // Access control // Input validation // Mapping to/from Thrift and internal types
  25. 25. Thrift Interface Thrift IDL$SRC/interface/cassandra.thrift
  26. 26. o.a.c.thrift.CassandraServer.get_slice() // get columns for one row Tracing.begin() ClientState cState = state() cState.hasColumnFamilyAccess() multigetSliceInternal()
  27. 27. CassandraServer.multigetSliceInternal() // get columns for may rows ThriftValidation.validate*() // Create ReadCommands getSlice()
  28. 28. CassandraServer.getSlice() // Process ReadCommands // return Thrift types readColumnFamily() thriftifyColumnFamily()
  29. 29. CassandraServer.readColumnFamily() // Process ReadCommands // Return ColumnFamilies StorageProxy.read()
  30. 30. API Services JMX CLI Thrift CQL 3
  31. 31. o.a.c.cql3.QueryProcessor // Prepares and executes CQL3 statements // Used by Thrift & Native transports // Access control // Input validation // Returns transport.ResultMessage
  32. 32. CQL3 Grammar ANTLR Grammar $SRC/o.a.c.cql3/Cql.g
  33. 33. o.a.c.cql3.statements.ParsedStatement // Subclasses generated by ANTLR // Tracks bound term count // Prepare CQLStatement prepare()
  34. 34. o.a.c.cql3.statements.CQLStatement checkAccess(ClientState state) validate(ClientState state) execute(ConsistencyLevel cl, QueryState state, List<ByteBuffer> variables)
  35. 35. o.a.c.cql3.functions.Function argsType() returnType() execute(List<ByteBuffer> parameters)
  36. 36. statements.SelectStatement.RawStatement // Implements ParsedStatement // Input validation prepare()
  37. 37. statements.SelectStatement.execute() // Create ReadCommands StorageProxy.read()
  38. 38. Architecture API Dynamo Database
  39. 39. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  40. 40. o.a.c.service.StorageProxy // Cluster wide storage operations // Select endpoints & check CL available // Send messages to Stages // Wait for response // Store Hints
  41. 41. o.a.c.service.StorageService // Ring operations // Track ring state // Start & stop ring membership // Node & token queries
  42. 42. o.a.c.service.IResponseResolver preprocess(MessageIn<T> message) resolve() throws DigestMismatchException RowDigestResolver RowDataResolver RangeSliceResponseResolver
  43. 43. Response Handlers / Callback implements IAsyncCallback<T> response(MessageIn<T> msg)
  44. 44. o.a.c.service.ReadCallback.get() //Wait for blockfor & data condition.await(timeout, TimeUnit.MILLISECONDS) throw ReadTimeoutException() resolver.resolve()
  45. 45. o.a.c.service.StorageProxy.fetchRows() getLiveSortedEndpoints() new RowDigestResolver() new ReadCallback() MessagingService.sendRR() --------------------------------------- ReadCallback.get() # blocking catch (DigestMismatchException ex) catch (ReadTimeoutException ex)
  46. 46. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  47. 47. o.a.c.net.MessagingService.verb<<enum>> MUTATION READ REQUEST_RESPONSE TREE_REQUEST TREE_RESPONSE (And more...)
  48. 48. o.a.c.net.MessagingService.verbHandlers new EnumMap<Verb, IVerbHandler>(Verb.class)
  49. 49. o.a.c.net.IVerbHandler<T> doVerb(MessageIn<T> message, String id);
  50. 50. o.a.c.net.MessagingService.verbStages new EnumMap<MessagingService.Verb, Stage>(MessagingService.Verb.class)
  51. 51. o.a.c.net.MessagingService.receive() runnable = new MessageDeliveryTask( message, id, timestamp); StageManager.getStage( message.getMessageType()); stage.execute(runnable);
  52. 52. o.a.c.net.MessageDeliveryTask.run() // If dropable and rpc_timeout MessagingService.incrementDroppedMessages(verb); MessagingService.getVerbHandler(verb) verbHandler.doVerb(message, id)
  53. 53. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  54. 54. o.a.c.dht.IPartitioner<T extends Token> getToken(ByteBuffer key) getRandomToken() LocalPartitioner RandomPartitioner Murmur3Partitioner
  55. 55. o.a.c.dht.Token<T> compareTo(Token<T> o) BytesToken BigIntegerToken LongToken
  56. 56. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  57. 57. o.a.c.locator.IEndpointSnitch getRack(InetAddress endpoint) getDatacenter(InetAddress endpoint) sortByProximity(InetAddress address, List<InetAddress> addresses) SimpleSnitch PropertyFileSnitch Ec2MultiRegionSnitch
  58. 58. o.a.c.locator.AbstractReplicationStrategy getNaturalEndpoints( RingPosition searchPosition) calculateNaturalEndpoints(Token searchToken, TokenMetadata tokenMetadata) SimpleStrategy NetworkTopologyStrategy
  59. 59. o.a.c.locator.TokenMetadata BiMultiValMap<Token, InetAddress> tokenToEndpointMap BiMultiValMap<Token, InetAddress> bootstrapTokens Set<InetAddress> leavingEndpoints
  60. 60. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
  61. 61. o.a.c.gms.VersionedValue // VersionGenerator.getNextVersion() public final int version; public final String value;
  62. 62. o.a.c.gms.ApplicationState<<enum>> STATUS LOAD SCHEMA DC RACK (And more...)
  63. 63. o.a.c.gms.HeartBeatState //VersionGenerator.getNextVersion(); private int generation; private int version;
  64. 64. o.a.c.gms.Gossiper.GossipTask.run() // SYN -> ACK -> ACK2 makeRandomGossipDigest() new GossipDigestSyn() // Use MessagingService.sendOneWay() Gossiper.doGossipToLiveMember() Gossiper.doGossipToUnreachableMember() Gossiper.doGossipToSeed()
  65. 65. gms.GossipDigestSynVerbHandler.doVerb() Gossiper.examineGossiper() new GossipDigestAck() MessagingService.sendOneWay()
  66. 66. gms.GossipDigestAck2VerbHandler.doVerb() Gossiper.notifyFailureDetector() Gossiper.applyStateLocally()
  67. 67. Architecture API LayerDynamo LayerDatabase Layer
  68. 68. Database Layer o.a.c.concurrent o.a.c.db o.a.c.cache o.a.c.io o.a.c.trace
  69. 69. o.a.c.concurrent.StageManager stages = new EnumMap<Stage, ThreadPoolExecutor>(Stage.class); getStage(Stage stage)
  70. 70. o.a.c.concurrent.Stage READ MUTATION GOSSIP REQUEST_RESPONSE ANTI_ENTROPY (And more...)
  71. 71. Database Layer o.a.c.concurrent o.a.c.db o.a.c.cache o.a.c.io o.a.c.trace
  72. 72. o.a.c.db.Table // Keyspace open(String table) getColumnFamilyStore(String cfName) getRow(QueryFilter filter) apply(RowMutation mutation, boolean writeCommitLog)
  73. 73. o.a.c.db.ColumnFamilyStore // Column Family getColumnFamily(QueryFilter filter) getTopLevelColumns(...) apply(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer)
  74. 74. o.a.c.db.IColumnContainer addColumn(IColumn column) remove(ByteBuffer columnName) ColumnFamily SuperColumn
  75. 75. o.a.c.db.ISortedColumns addColumn(IColumn column, Allocator allocator) removeColumn(ByteBuffer name) ArrayBackedSortedColumns AtomicSortedColumns TreeMapBackedSortedColumns
  76. 76. o.a.c.db.Memtable put(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer) flushAndSignal(CountDownLatch latch, Future<ReplayPosition> context)
  77. 77. Memtable.FlushRunnable.writeSortedContents() // SSTableWriter createFlushWriter() // Iterate through rows & CF’s in order writer.append()
  78. 78. o.a.c.db.ReadCommand getRow(Table table) SliceByNamesReadCommand SliceFromReadCommand
  79. 79. o.a.c.db.IDiskAtomFilter getMemtableColumnIterator(...) getSSTableColumnIterator(...) IdentityQueryFilter NamesQueryFilter SliceQueryFilter
  80. 80. Thanks.
  81. 81. Aaron Morton @aaronmorton www.thelastpickle.comLicensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

×