Introduce Apache Cassandra - JavaTwo Taiwan, 2012

  • 1,727 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to like this
No Downloads

Views

Total Views
1,727
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
54
Comments
3
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 美商優科無線 資深工程師 Boris Yen
  • 2. 專家講座 B:淺談 Apache Cassandra
  • 3. Outline• Cassandra vs SQL Server• Overview• Data in Cassandra• Data Partitioning• Data Replication• Data Consistency• Client Libraries
  • 4. Cassandra vs SQL Server• Cassandra o More servers = More capacity. o The concerns of scaling is transparent to application. o No single point of failure. o Horizontal scale.• SQL Server o More power machine = More capacity. o Adding capacity requires manual labor from ops people and substantial downtime. o There would be limit on how big you could go. o Vertical scale, Moore’s law scaling
  • 5. Overview• Features are coming from Dynamo and BigTable• Distributed o Data partitioned among all nodes• Extremely Scalable o Add new node = Add more capacity o Easy to add new node• Fault tolerant o All nodes are the same o Read/Write anywhere o Automatic Data replication• High Performance
  • 6. Overviewhttp://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance http://www.cubrid.org/blog/dev-platform/nosql- benchmarking/ http://techblog.netflix.com/2011/11/benchmarking- cassandra-scalability-on.html
  • 7. Data in Cassandra• Keyspace ~ Database in RDBMS• Column Family ~ Table in RDBMS Keyspace ColumnFamily { column: Phone, ID Addr Phone value: 09..., Key: Boris timestamp: 1000 1 ... Taiwan 09..... } timestamp is used to resolve conflict.
  • 8. Data in Cassandra• Keyspace o Where the replication strategy and replication factor is defined. CREATE KEYSPACE keyspace_name WITH strategy_class = SimpleStrategy AND strategy_options:replication_factor=2;• ColumnFamily CREATE COLUMNFAMILY user ( id uuid PRIMARY KEY, address text, userName text ) WITH comment= AND comparator=text AND read_repair_chance=0.100000 AND gc_grace_seconds=864000 AND default_validation=text AND min_compaction_threshold=4 AND max_compaction_threshold=32 AND replicate_on_write=True AND compaction_strategy_class=SizeTieredCompactionStrategy AND compression_parameters:sstable_compression=org.apache.cassandra.io.compress.SnappyCompres sor;
  • 9. Data in Cassandra• Commit log o Used to capture write activities. Data durability is assured.• Memtable o Used to store most recent write activities.• SSTable o When a memtable got flushed to disk, it becomes a sstable.
  • 10. Data Read/Write• Write Data Commitlog Memtable Flushed SSTable• Read o Search Row cache, if the result is not empty, then return the result. No further actions are needed. o If no hit in the Row cache. Try to get data from Memtable(s) and SSTable(s) that might contain requested key. Collate the results and return.
  • 11. Data Compaction t2 > t1 Boris:{ name: boris (t1)sstable1 phone: 092xxx (t1) addr: tainan (t1) } Boris:{ addr: tainan (t1) email: y@gmail (t2) sstableX name: boris.yen (t2) Boris:{ phone: 092xxx (t1) name: boris.yen (t2) sex: male (t2)sstable2 sex: male (t2) email: y@gmail (t2) } } . . . .
  • 12. Data Partitioning• The total data managed by the cluster is represented as a circular space or ring.• Before a node can join the ring, it must be assigned a token.• The token determines the node’s position on the ring and the range of data it is responsible for.• Partitioning strategy o Random Partitioning  Default and Recommended o Order Partitioning  Sequential writes can cause hot spots  More administrative overhead to load balance the cluster
  • 13. Data Partitioning Random Partitioning t1 hash(k2) hash(k1)Data: k1 t5 t2 Data: k3 hash(k4) hash(k3) t4 t3
  • 14. Data Replication• To ensure fault tolerance and no single point of failure.• Replication is controlled by the parameters replication factor and replication strategy of a keyspace.• Replication factor controls how many copies of a row should be stored in the cluster• Replication strategy controls how the data being replicated.
  • 15. Data Replication Random Partitioning t1 RF=3 hash(k1)Data: k1 t5 t2 coordinator t4 t3
  • 16. Data Consistency• Cassandra supports tunable data consistency.• Choose from strong and eventual consistency depending on the need.• Can be done on a per-operation basis, and for both reads and writes.• Handles multi-data center operations
  • 17. Consistency Level Write Read Any One One Quorum QuorumLocal_Quorum Local_QuorumEach_Quorum Each_Quorum All All
  • 18. Built-in Consistency Repair Features• Read Repair• Hinted Handoff• Anti-Entropy Node Repairhttp://www.datastax.com/docs/0.8/dml/data_consistency#builtin-consistency
  • 19. Client Library for Java• Hector o https://github.com/hector-client/hector.git o https://github.com/hector-client/hector/wiki/User- Guide• Astyanax o https://github.com/Netflix/astyanax.git• CQL + JDBC o http://code.google.com/a/apache- extras.org/p/cassandra-jdbc/
  • 20. Hector• High level, simple object oriented interface to cassandra• Failover behavior on the client side• Connection pooling for improved performance and scalability• Automatic retry of downed hosts...
  • 21. Hector// slice querySliceQuery<String, String> q = HFactory.createSliceQuery(ko, se, se, se);q.setColumnFamily(cf).setKey("jsmith").setColumnNames("first", "last","middle");Result<ColumnSlice<String, String>> r = q.execute();// multi-getMultigetSliceQuery<String, String, String> multigetSliceQuery = HFactory.createMultigetSliceQuery(keyspace, stringSerializer, stringSerializer,stringSerializer);multigetSliceQuery.setColumnFamily("Standard1");multigetSliceQuery.setKeys("fake_key_0", "fake_key_1", "fake_key_2", "fake_key_3", "fake_key_4");multigetSliceQuery.setRange("", "", false, 3);Result<Rows<String, String, String>> result = multigetSliceQuery.execute();// batch operationMutator<String> mutator = HFactory.createMutator(keyspace, stringSerializer);mutator.addInsertion("jsmith", "Standard1",HFactory.createStringColumn("first", "John")).addInsertion("jsmith","Standard1", HFactory.createStringColumn("last","Smith")).addInsertion("jsmith", "Standard1",HFactory.createStringColumn("middle", "Q"));mutator.execute();https://github.com/hector-client/hector/wiki/User-Guide
  • 22. CQL+JDBCClass.forName("org.apache.cassandra.cql.jdbc.CassandraDriver"); String URL = String.format("jdbc:cassandra://%s:%d/%s",HOST,PORT,"system"); System.out.println("Connection URL = "+URL +""); con = DriverManager.getConnection(URL); Statement stmt = con.createStatement();// Create KeySpaceString createKS = String.format("CREATE KEYSPACE %s WITH strategy_class =SimpleStrategy AND strategy_options:replication_factor = 1;",KEYSPACE);stmt.execute(createKS);// Create the target Column family String createCF = "CREATE COLUMNFAMILY RegressionTest (keyname text PRIMARYKEY,” + "bValue boolean, “+ "iValue int “+ ") WITH comparator = ascii AND default_validation =bigint;"; stmt.execute(createCF);https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/browse/src/test/java/org/apache/cassandra/cql/jdbc/JdbcRegressionTest.java
  • 23. CQL+JDBCStatement statement = con.createStatement();String truncate = "TRUNCATE RegressionTest;";statement.execute(truncate);String insert1 = "INSERT INTO RegressionTest (keyname,bValue,iValue) VALUES (key0,true,2000);";statement.executeUpdate(insert1);String insert2 = "INSERT INTO RegressionTest (keyname,bValue) VALUES( key1,false);";statement.executeUpdate(insert2);String select = "SELECT * from RegressionTest;";ResultSet result = statement.executeQuery(select);ResultSetMetaData metadata = result.getMetaData();...https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/source/browse/src/test/java/org/apache/cassandra/cql/jdbc/JdbcRegressionTest.java
  • 24. Useful Tools• cassandra-cli o <cassandra-dir>/bin o http://www.datastax.com/docs/1.0/dml/using_cli• cqlsh o <cassandra-dir>/bin o http://www.datastax.com/docs/1.0/references/cql/index• nodetool o <cassandra-dir>/bin o http://www.datastax.com/docs/1.0/references/nodetool• stress o <cassandra-dir>/tools/bin o http://www.datastax.com/docs/1.0/references/stress_java
  • 25. Useful Tools• OpsCenter o http://www.datastax.com/products/opscenter• sstableloader o <cassandra-dir>/bin o http://www.datastax.com/dev/blog/bulk-loading• More tools http://en.wikipedia.org/wiki/Apache_Cassandra#Tools _for_Cassandra
  • 26. Questions?