• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Cassandra in Bangalore - Cassandra Internals and Performance
 

Apache Cassandra in Bangalore - Cassandra Internals and Performance

on

  • 871 views

Slides from http://www.meetup.com/Apache-Cassandra/events/108524582/

Slides from http://www.meetup.com/Apache-Cassandra/events/108524582/

Statistics

Views

Total Views
871
Views on SlideShare
871
Embed Views
0

Actions

Likes
1
Downloads
31
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Apache Cassandra in Bangalore - Cassandra Internals and Performance Apache Cassandra in Bangalore - Cassandra Internals and Performance Presentation Transcript

    • BANGALORE CASSANDRA UG APRIL 2013CASSANDRA INTERNALS & PERFORMANCE Aaron Morton @aaronmorton www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
    • Architecture Code
    • Cassandra Architecture Clients APIs Cluster Aware Cluster Unaware Disk
    • Cassandra Cluster Architecture Clients APIs APIs Cluster Aware Cluster Aware Cluster Unaware Cluster Unaware Disk Disk Node 1 Node 2
    • Dynamo Cluster Architecture Clients APIs APIs Dynamo Dynamo Database Database Disk Disk Node 1 Node 2
    • Architecture API Dynamo Database
    • API Transports Thrift Native Binary Read Line RMI
    • Thrift Transport //Custom TServer implementations o.a.c.thrift.CustomTThreadPoolServer o.a.c.thrift.CustomTNonBlockingServer o.a.c.thrift.CustomTHsHaServer
    • API Transports Thrift Native Binary Read Line RMI
    • Native Binary Transport Beta in Cassandra 1.2 Uses Netty 3.5 Enabled with start_native_transport (Disabled by default)
    • o.a.c.transport.Server.run() //Setup the Netty server new ExecutionHandler() new NioServerSocketChannelFactory() ServerBootstrap.setPipelineFactory()
    • o.a.c.transport.Message.Dispatcher.messageReceived() //Process message from client ServerConnection.validateNewMessage() Request.execute() ServerConnection.applyStateTransition() Channel.write()
    • o.a.c.transport.messages CredentialsMessage() EventMessage() ExecuteMessage() PrepareMessage() QueryMessage() ResultMessage() (And more...)
    • Messages Defined in the Native Binary Protocol $SRC/doc/native_protocol.spec
    • API Services JMX CLI Thrift CQL 3
    • JMX Management Beans Spread around the code base. Interfaces named *MBean
    • JMX Management Beans Registered with the names such as org.apache.cassandra.db: type=StorageProxy
    • API Services JMX CLI Thrift CQL 3
    • o.a.c.cli.CliMain.main() // Connect to server to read input this.connect() this.evaluateFileStatements() this.processStatementInteractive()
    • CLI Grammar ANTLR Grammar $SRC/src/java/o/a/c/cli/CLI.g
    • o.a.c.cli.CliClient.executeCLIStatement() // Process statement CliCompiler.compileQuery() #ANTLR switch (tree.getType()) case...
    • API Services JMX CLI Thrift CQL 3
    • o.a.c.thrift.CassandraServer // Implements Thrift Interface // Access control // Input validation // Mapping to/from Thrift and internal types
    • Thrift Interface Thrift IDL$SRC/interface/cassandra.thrift
    • o.a.c.thrift.CassandraServer.get_slice() // get columns for one row Tracing.begin() ClientState cState = state() cState.hasColumnFamilyAccess() multigetSliceInternal()
    • CassandraServer.multigetSliceInternal() // get columns for may rows ThriftValidation.validate*() // Create ReadCommands getSlice()
    • CassandraServer.getSlice() // Process ReadCommands // return Thrift types readColumnFamily() thriftifyColumnFamily()
    • CassandraServer.readColumnFamily() // Process ReadCommands // Return ColumnFamilies StorageProxy.read()
    • API Services JMX CLI Thrift CQL 3
    • o.a.c.cql3.QueryProcessor // Prepares and executes CQL3 statements // Used by Thrift & Native transports // Access control // Input validation // Returns transport.ResultMessage
    • CQL3 Grammar ANTLR Grammar $SRC/o.a.c.cql3/Cql.g
    • o.a.c.cql3.statements.ParsedStatement // Subclasses generated by ANTLR // Tracks bound term count // Prepare CQLStatement prepare()
    • o.a.c.cql3.statements.CQLStatement checkAccess(ClientState state) validate(ClientState state) execute(ConsistencyLevel cl, QueryState state, List<ByteBuffer> variables)
    • o.a.c.cql3.functions.Function argsType() returnType() execute(List<ByteBuffer> parameters)
    • statements.SelectStatement.RawStatement // Implements ParsedStatement // Input validation prepare()
    • statements.SelectStatement.execute() // Create ReadCommands StorageProxy.read()
    • Architecture API Dynamo Database
    • Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
    • o.a.c.service.StorageProxy // Cluster wide storage operations // Select endpoints & check CL available // Send messages to Stages // Wait for response // Store Hints
    • o.a.c.service.StorageService // Ring operations // Track ring state // Start & stop ring membership // Node & token queries
    • o.a.c.service.IResponseResolver preprocess(MessageIn<T> message) resolve() throws DigestMismatchException RowDigestResolver RowDataResolver RangeSliceResponseResolver
    • Response Handlers / Callback implements IAsyncCallback<T> response(MessageIn<T> msg)
    • o.a.c.service.ReadCallback.get() //Wait for blockfor & data condition.await(timeout, TimeUnit.MILLISECONDS) throw ReadTimeoutException() resolver.resolve()
    • o.a.c.service.StorageProxy.fetchRows() getLiveSortedEndpoints() new RowDigestResolver() new ReadCallback() MessagingService.sendRR() --------------------------------------- ReadCallback.get() # blocking catch (DigestMismatchException ex) catch (ReadTimeoutException ex)
    • Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
    • o.a.c.net.MessagingService.verb<<enum>> MUTATION READ REQUEST_RESPONSE TREE_REQUEST TREE_RESPONSE (And more...)
    • o.a.c.net.MessagingService.verbHandlers new EnumMap<Verb, IVerbHandler>(Verb.class)
    • o.a.c.net.IVerbHandler<T> doVerb(MessageIn<T> message, String id);
    • o.a.c.net.MessagingService.verbStages new EnumMap<MessagingService.Verb, Stage>(MessagingService.Verb.class)
    • o.a.c.net.MessagingService.receive() runnable = new MessageDeliveryTask( message, id, timestamp); StageManager.getStage( message.getMessageType()); stage.execute(runnable);
    • o.a.c.net.MessageDeliveryTask.run() // If dropable and rpc_timeout MessagingService.incrementDroppedMessages(verb); MessagingService.getVerbHandler(verb) verbHandler.doVerb(message, id)
    • Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
    • o.a.c.dht.IPartitioner<T extends Token> getToken(ByteBuffer key) getRandomToken() LocalPartitioner RandomPartitioner Murmur3Partitioner
    • o.a.c.dht.Token<T> compareTo(Token<T> o) BytesToken BigIntegerToken LongToken
    • Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
    • o.a.c.locator.IEndpointSnitch getRack(InetAddress endpoint) getDatacenter(InetAddress endpoint) sortByProximity(InetAddress address, List<InetAddress> addresses) SimpleSnitch PropertyFileSnitch Ec2MultiRegionSnitch
    • o.a.c.locator.AbstractReplicationStrategy getNaturalEndpoints( RingPosition searchPosition) calculateNaturalEndpoints(Token searchToken, TokenMetadata tokenMetadata) SimpleStrategy NetworkTopologyStrategy
    • o.a.c.locator.TokenMetadata BiMultiValMap<Token, InetAddress> tokenToEndpointMap BiMultiValMap<Token, InetAddress> bootstrapTokens Set<InetAddress> leavingEndpoints
    • Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream
    • o.a.c.gms.VersionedValue // VersionGenerator.getNextVersion() public final int version; public final String value;
    • o.a.c.gms.ApplicationState<<enum>> STATUS LOAD SCHEMA DC RACK (And more...)
    • o.a.c.gms.HeartBeatState //VersionGenerator.getNextVersion(); private int generation; private int version;
    • o.a.c.gms.Gossiper.GossipTask.run() // SYN -> ACK -> ACK2 makeRandomGossipDigest() new GossipDigestSyn() // Use MessagingService.sendOneWay() Gossiper.doGossipToLiveMember() Gossiper.doGossipToUnreachableMember() Gossiper.doGossipToSeed()
    • gms.GossipDigestSynVerbHandler.doVerb() Gossiper.examineGossiper() new GossipDigestAck() MessagingService.sendOneWay()
    • gms.GossipDigestAckVerbHandler.doVerb() Gossiper.notifyFailureDetector() Gossiper.applyStateLocally() Gossiper.makeGossipDigestAck2Message()
    • gms.GossipDigestAcksVerbHandler.doVerb() Gossiper.notifyFailureDetector() Gossiper.applyStateLocally()
    • Architecture API LayerDynamo LayerDatabase Layer
    • Database Layer o.a.c.concurrent o.a.c.db o.a.c.cache o.a.c.io o.a.c.trace
    • o.a.c.concurrent.StageManager stages = new EnumMap<Stage, ThreadPoolExecutor>(Stage.class); getStage(Stage stage)
    • o.a.c.concurrent.Stage READ MUTATION GOSSIP REQUEST_RESPONSE ANTI_ENTROPY (And more...)
    • Database Layer o.a.c.concurrent o.a.c.db o.a.c.cache o.a.c.io o.a.c.trace
    • o.a.c.db.Table // Keyspace open(String table) getColumnFamilyStore(String cfName) getRow(QueryFilter filter) apply(RowMutation mutation, boolean writeCommitLog)
    • o.a.c.db.ColumnFamilyStore // Column Family getColumnFamily(QueryFilter filter) getTopLevelColumns(...) apply(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer)
    • o.a.c.db.IColumnContainer addColumn(IColumn column) remove(ByteBuffer columnName) ColumnFamily SuperColumn
    • o.a.c.db.ISortedColumns addColumn(IColumn column, Allocator allocator) removeColumn(ByteBuffer name) ArrayBackedSortedColumns AtomicSortedColumns TreeMapBackedSortedColumns
    • o.a.c.db.Memtable put(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer) flushAndSignal(CountDownLatch latch, Future<ReplayPosition> context)
    • Memtable.FlushRunnable.writeSortedContents() // SSTableWriter createFlushWriter() // Iterate through rows & CF’s in order writer.append()
    • o.a.c.db.ReadCommand getRow(Table table) SliceByNamesReadCommand SliceFromReadCommand
    • o.a.c.db.IDiskAtomFilter getMemtableColumnIterator(...) getSSTableColumnIterator(...) IdentityQueryFilter NamesQueryFilter SliceQueryFilter
    • Some query performance...
    • Today. Write Path Read Path
    • memtable_flush_queue_size test... m1.xlarge Cassandra node m1.xlarge client node 1 CF with 6 Secondary Indexes 1 Client Thread 10,000 Inserts, 100 Columns per Row 1100 bytes per Column
    • CF write latency and memtable_flush_queue_size... memtable_flush_queue_size=7 memtable_flush_queue_size=1 1,200 900Latency Microseconds 600 300 0 85th 95th 99th 100th
    • Request latency and memtable_flush_queue_size... memtable_flush_queue_size=7 memtable_flush_queue_size=1 5,000,000 3,750,000Latecy Microseconds 2,500,000 1,250,000 0 85th 95th 99th 100th
    • durable_writes test... 10,000 Inserts, 50 Columns per Row 50 bytes per Column
    • Request latency and durable_writes (1 client)... enabled disabled 7,000 5,250Latency Microseconds 3,500 1,750 0 85th 95th 99th
    • Request latency and durable_writes (10 clients)... enabled disabled 30,000 22,500Latency Microseconds 15,000 7,500 0 85th 95th 99th
    • Request latency and durable_writes (20 clients)... enabled disabled 90,000 67,500Latency Microseconds 45,000 22,500 0 85th 95th 99th
    • CommitLog tests... 10,000 Inserts, 50 Columns per Row 50 bytes per Column
    • periodic commit log adds mutation to queue then acknowledges. Commit Log is appended to by a single thread, sync is called everycommitlog_sync_period_in_ms.
    • Request latency and commitlog_sync_period_in_ms... 10,000 ms 10 ms 220 208 Latecy Microseconds 195 183 170 85th 95th 99th
    • batch commit log adds mutation to queue and waits before acknowledging. Writer thread processes mutations forcommitlog_sync_batch_window_in_ ms duration, then syncs, then signals.
    • Request latency comparing periodic and batch sync... periodic batch 800 600 Latecy Microseconds 400 200 0 85th 95th 99th
    • Merge mutation... Row level Isolation provided via SnapTree. (https://github.com/nbronson/snaptree)
    • Row concurrency tests... 10,000 Columns per Row 50 bytes per Column 50 Columns per Insert
    • CF Write Latency and row concurrency (10 clients)... different rows single row 2,000 1,500Latecy Microseconds 1,000 500 0 85th 95th 99th
    • Secondary Indexes... synchronized access to indexed rows. (Keyspace wide)
    • Index concurrency tests... CF with 2 Indexes 10,000 Inserts 6 Columns per Row 35 bytes per Column Alternating column values
    • Request latency and index concurrency (10 clients)... different rows single row 4,000 3,000Latecy Microseconds 2,000 1,000 0 85th 95th 99th
    • Index tests... 10,000 Inserts 50 Columns per Row 50 bytes per Column
    • Request latency and secondary indexes... no indexes six indexes 3,000 2,250Latecy Microseconds 1,500 750 0 85th 95th 99th
    • Today Write Path Read Path
    • bloom_filter_fp_chance tests... 1,000,000 Rows 50 Columns per Row 50 bytes per Column commitlog_total_space_in_mb: 1 Read random 10% of rows.
    • CF read latency and bloom_filter_fp_chance... default 0.000744. 0.1 7,000 5,250Latecy Microseconds 3,500 1,750 0 85th 95th 99th
    • key_cache_size_in_mb tests... 10,000 Rows 50 Columns per Row 50 bytes per Column Read all Rows
    • CF read latency and key_cache_size_in_mb... default (100MB) 100% Hit Rate disabled 300 225 Latecy Microseconds 150 75 0 85th 95th 99th
    • index_interval tests... 100,000 Rows 50 Columns per Row 50 bytes per Column key_cache_size_in_mb: 0 Read 1 Column from random 10% of Rows
    • CF read latency and index_interval... index_interval=128 (default) index_interval=512 20,000 15,000Latecy Microseconds 10,000 5,000 0 85th 95th 99th
    • row_cache_size_in_mb tests... 100,000 Rows 50 Columns per Row 50 bytes per Column Read all Rows
    • CF read latency and row_cache_size_in_mb... row_cache_size_in_mb=0 and key_cache_size_in_mb=100mb row_cache_size_in_mb=100mb and key_cache_size_in_mb=0 260 195 Latecy Microseconds 130 65 0 85th 95th 99th
    • Column Index tests... Read first Column by name from 1,200 Columns. Read first Column by name from 1,000,000 Columns.
    • CF read latency and Column Index... First Column from 1,200 First Column from 1,000,000 6,000 4,500Latecy Microseconds 3,000 1,500 0 85th 95th 99th
    • Name Locality tests... 1,000,000 Columns 50 bytes per Column Read 100 Columns from middle of row. Read 100 Columns from spread across row.
    • CF read latency and name locality... Adjacent Columns Spread Columns 200,000 150,000Latecy Microseconds 100,000 50,000 0 85th 95th 99th
    • Start position tests... 1,000,000 Columns 50 bytes per Column Read first 100 Columns without start. Read first 100 Columns with start.
    • CF read latency and start position... Without start position With start position 40,000 30,000Latecy Microseconds 20,000 10,000 0 85th 95th 99th
    • Start offset tests... 1,000,000 Columns 50 bytes per Column Read first 100 Columns with start. Read middle 100 Columns with start.
    • CF read latency and start offset... First MIddle 40,000 30,000Latecy Microseconds 20,000 10,000 0 85th 95th 99th
    • Start offset tests... 1,000,000 Columns 50 bytes per Column Read first 100 Columns without start. Read last 100 Columns with reversed.
    • CF read latency and reversed... Forward Reversed 40,000 30,000Latecy Microseconds 20,000 10,000 0 85th 95th 99th
    • Thanks.
    • Aaron Morton @aaronmorton www.thelastpickle.comLicensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License