Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Cassandra

4,426 views

Published on

Introduction to Apache Cassandra (September 2014).
Design principles, replication, consistency, clusters, CQL.

Published in: Data & Analytics

Introduction to Apache Cassandra

  1. 1. Introduction to Apache 1
  2. 2. Me Robert Stupp Freelancer, Coder, Architect @snazy snazy@snazy.de Contributor to Apache Cassandra, 3.0 UDFs (CASSANDRA-7395 + related) Databases, Network, Backend 2
  3. 3. Agenda Apache Cassandra History Design Principles Outstanding differences CQL Intro Access C* Clusters Cassandra Future 3
  4. 4. Apache Cassandra History 4
  5. 5. Apache Cassandra started at Facebook inspired by Note: Facebook initially had two data centers. 5
  6. 6. 2.1 released in Sep 2014 6
  7. 7. Apache Cassandra Design Principles 7
  8. 8. Hardware failures can and will occur! Cassandra handles failures. From single node to whole data center. From client to server. 8
  9. 9. The complicated part when learning Cassandra, is to understand Cassandra’s simplicity 9
  10. 10. Keep it simple all nodes are equal master-less architecture no name nodes no SPOF (single point of failure) no read before modify (prevent race conditions) 10
  11. 11. Keep it running No need to take cluster down … e.g. during maintenance during software update Rolling restart is your friend 11
  12. 12. Outstanding Differences 12
  13. 13. Cassandra Highly scalable runs with a few nodes up to 1000+ nodes cluster! Linear scalability (proven!) Multi datacenter aware (world-wide!) No SPOF 13
  14. 14. Cassandra @ Apple 14
  15. 15. Linear Scalability 15
  16. 16. Scaling Cassandra More data? -> add more nodes Faster access? -> add more nodes 16
  17. 17. Read / Write performance Reads are fast Writes are even faster 17
  18. 18. Durability Writes are durable - period. 18
  19. 19. Availability @ Netflix 19 Chaos Monkey kills nodes randomly
  20. 20. Availability @ Netflix 20 Chaos Gorilla kill regions randomly
  21. 21. Availability @ Netflix Chaos Kong kills whole data centers 21
  22. 22. Availability @ Netflix http://de.slideshare.net/planetcassandra/ active-active-c-behind-the-scenes-at-netflix 22
  23. 23. 32 node cluster (Rasperry PIs) @DataStax 23
  24. 24. Most outstanding Great documentation Many blog posts Many presentations Many videos Regular webinars Huge, active and healthy community 24
  25. 25. Data Distribution 25
  26. 26. DHT Data is organized in a „Distributed Hash Table“ (hash over row key) 26
  27. 27. DHT 0 27 1 2 3 4 5 6 7
  28. 28. Replication 28
  29. 29. Replication Factor 2 0 29 1 2 3 4 5 6 7 Row A Row B
  30. 30. Replication Factor 3 0 30 1 2 3 4 5 6 7 Row A Row B
  31. 31. Consistency Consistency defined per request Several consistency levels (CLs) for different needs 31
  32. 32. Eventual consistency is not hopefully consistent EC means there’s a time gap until updates are consistently readable 32
  33. 33. Consistency Levels ANY (only for writes) ONE, LOCAL_ONE, TWO, THREE, (not recommended) ALL, (not recommended) QUORUM, LOCAL_QUORUM, EACH_QUORUM SERIAL, LOCAL_SERIAL 33
  34. 34. Consistency Data is always replicated CL defines how many replicas must fulfill the request 34
  35. 35. Write 0 35 1 2 3 4 5 6 7 Write
  36. 36. Write 0 36 1 2 3 4 5 6 7 Write
  37. 37. Mutli DC setup DC 1 DC 2 37
  38. 38. Multi DC replication 38 Write DC 1 DC 2
  39. 39. Mutli DC replication 39 Write DC 1 DC 2
  40. 40. Mutli DC replication 40 Write DC 1 DC 2
  41. 41. Replication & Consistency Define # of replicas using replication factor Define required consistency per request 41
  42. 42. CQL Introduction CQL = Cassandra query language 42
  43. 43. “CQL is SQL minus joins, minus subqueries, plus collections” (plus user types, plus tuple types) 43
  44. 44. Why CQL? Introduces a schema to Cassandra Familiar syntax Easy to understand DML operations are atomic 44
  45. 45. Data model (hierarchical view) Keyspace (schema) Table (column family) Row partition key (part of primary key) static columns clustering key (part of primary key) columns 45
  46. 46. CQL / DDL Similar to SQL CREATE TABLE … ALTER TABLE … DROP TABLE … 46
  47. 47. CQL / DML Similar to SQL INSERT … UPDATE … DELETE … SELECT … 47
  48. 48. CQL / BATCH Group related modifications (INSERT, UPDATE, DELETE) Atomic operation 48
  49. 49. CQL types boolean, int (32bit), bigint (64bit), float, double, decimal ("BigDecimal"), varint ("BigInteger"), ascii, text (= varchar), blob, inet, timestamp, uuid, timeuuid 49
  50. 50. CQL collection types list < foo > set < foo > map < foo , bar > Since C* 2.1 collections can contain any type - even other collections. 50
  51. 51. CQL composite types user types (C* 2.1) are composite types with named fields tuple types (C* 2.1) are unstructured lists of values 51
  52. 52. CQL / user types CREATE TYPE address ( street text, zip int, city text); CREATE TABLE users ( username text, addresses map<text, address>, ... 52
  53. 53. Cassandra Data Modeling Access by key no access by arbitrary WHERE clause Duplicate data (it’s ok!) Aggregate data Build application maintained indexes 53
  54. 54. RDBMS modeling 54
  55. 55. C* modeling 55
  56. 56. Data Modeling with RDBMS Driven by "How can I store something right?" "What answers do I have?" 56
  57. 57. Data Modeling with NoSQL Driven by "How can I access something right?" "What questions do I have?" 57
  58. 58. Data Modeling Basics Work top-down. Think about: What does the application do? What are the access patterns? Now design data model 58
  59. 59. Data Modeling http://de.slideshare.net/planetcassandra/ cassandra-day-sv-2014-fundamentals-of- apache-cassandra-data-modeling http://de.slideshare.net/planetcassandra/ data-modeling-with-travis-price 59
  60. 60. Accessing Cassandra 60
  61. 61. Command Line cqlsh CQL shell nodetool node/cluster administration 61
  62. 62. GUI: DevCenter Visual query tool 62
  63. 63. Stress test? Cassandra 2.1 comes with improved stress tool Simulate read+write workload Uses configurable data Works against older C* versions, too 63
  64. 64. DataStax APLv2 Open Source Drivers for Java for Python for C# for Scala / Spark https://github.com/datastax/ or http://www.datastax.com/download 64
  65. 65. Native protocol C*’s own net protocol for clients Request multiplexing Schema change notifications Cluster change notifications 65
  66. 66. Third Party Drivers for huge number of languages 66
  67. 67. Mappers High level mappers exist at least for Java Special case: Scala due to its strong+complex type model (DataStax OSS Spark driver) 67
  68. 68. Spark + Hadoop Yes - works really good Note: Spark is about 100x faster 68
  69. 69. Clusters 69
  70. 70. Cluster sizes C* works with a few nodes C* works with several hundred / thousand nodes 70
  71. 71. Cluster setup Configure for multiple data centers Plan for multi-DC setup :) 71
  72. 72. Cluster experience Remember: A single Cassandra clusters works over multiple data centers all over the world „Desaster proven“ Hurricanes Amazon DC outages 72
  73. 73. Apache Cassandra Future 73
  74. 74. Cassandra 3.0 (in development) User Defined Functions Aggregate functions Functional indexes Workload recording + playback Better SSTables, Fully off-heap row cache, Better serial consistency Indexes w/ high cardinality 74 Subject to change!!!
  75. 75. Get active ! 75
  76. 76. Cassandra Community http://cassandra.apache.org/ http://planetcassandra.org/ - Blog http://www.slideshare.net/ planetcassandra/presentations http://de.slideshare.net/DataStax/ presentations 76
  77. 77. Cassandra Community https://www.youtube.com/user/ PlanetCassandra https://www.youtube.com/user/DataStax http://www.datastax.com/dev/blog/ http://www.datastax.com/docs/ Users Mailing List users@cassandra.apache.org 77
  78. 78. Free C* Training! http://planetcassandra.org/cassandra-training/ 78
  79. 79. Get involved! Ask questions, submit RFEs or experiences to user mailing list user@cassandra.apache.org Answers arrive quickly! 79
  80. 80. Live Demo User Defined Functions 80
  81. 81. C* 3.0 UDFs Users create functions using CREATE FUNCTION … LANGUAGE … AS … Java, JavaScript, Scala, Groovy, JRuby, Jython Functions work on all nodes 81
  82. 82. C* 3.0 UDFs Example CREATE FUNCTION sin(input double) RETURNS double LANGUAGE javascript AS 'Math.sin(input)'; 82 This is JavaScript!
  83. 83. UDFs for what? Own aggregation code - e.g. SELECT sum(value) FROM table WHERE …; Functional indexes - e.g. CREATE INDEX idx ON table ( myFunction(colname) ); 83 Targeted for C* 3.0
  84. 84. Thanks for your attention Download Apache Cassandra at http://cassandra.apache.org/ Robert Stupp @snazy snazy@snazy.de de.slideshare.net/RobertStupp 84
  85. 85. Q & A 85
  86. 86. 86
  87. 87. BACKUP SLIDES User-Defined-Functions Demo 87
  88. 88. 88
  89. 89. 89
  90. 90. 90
  91. 91. 91
  92. 92. 92
  93. 93. 93
  94. 94. 94
  95. 95. 95
  96. 96. 96
  97. 97. 97
  98. 98. 98
  99. 99. 99

×