More Related Content
Similar to Toronto jaspersoft meetup
Similar to Toronto jaspersoft meetup (20)
More from Patrick McFadin
More from Patrick McFadin (18)
Toronto jaspersoft meetup
- 1. Toronto Jaspersoft User Group
Move. Faster.
Patrick McFadin, Principal Solution Architect
@PatrickMcFadin
©2012 DataStax
1
- 2. About Me/Moi?
• Principal Solution Architect at DataStax, THE
Cassandra company
• Cassandra user since .7
• Prior
- Chief Architect at Hobsons
- Started a software services company. Link-11
• Follow me here: @PatrickMcFadin
©2012 DataStax
©2012 DataStax
2 2
- 3. Who is
• We employ most of the Cassandra committers
• 24/7 support
• Consulting
• DataStax enterprise
©2012 DataStax
©2012 DataStax
3 3
- 4. And beer!
And cupcakes! (??)
©2012 DataStax
4
- 5. Our Solution
DataStax Enterprise allows
you to focus on your Big Data
applications instead of battling
your underlying infrastructure:
•Velocity
•Volume
•Variety
•Complexity
•Distribution
©2012 DataStax
5
- 7. Cassandra as real-
time foundation
•Continuous availability
•Extreme scale
•Multi-datacenter support
•Cloud enablement
•Operational simplicity
©2012 DataStax
7
- 8. Hadoop in the
same system:
•Batch analytics
•Reduced data movement,
less ETL operations
•No complex architectures
•Integrated mahout, sqoop,
hive, pig, etc.
©2012 DataStax
8
- 10. Can we just talk
about Cassandra
... and aliens.
©2012 DataStax
10
- 11. Roots
Dynamo
BigTable
©2012 DataStax
11
- 15. Core concepts Scaling
• Need more write throughput? - add nodes
• Need more read throughput? - add nodes
• Cassandra scales in a linear fashion
• Massive number of ops/sec
©2012 DataStax
15
- 16. Core concepts Scaling
Source: Solving big data challenges for enterprise application performance management
Proceedings of the VLDB Endowment, Volume 5 Issue 12, August 2012, Pages 1724-1735
©2012 DataStax
16
- 17. Core concepts CAP Theorem
Partition- onsistency-
C
Nodes can’t see Eventual, but
each other but Cassandra will not
cluster is still up lose your data.
Cassandra lives
Availability- ...and sometimes
Max uptime for
here clients lives here
It’s your choice!
©2012 DataStax
17
- 18. Core concepts Availability
Text
Continuous Availability > High Availability
Your infrastructure will fail
...deal with it.
©2012 DataStax
18
- 20. Data Model Basics Cluster
Cluster - Multiple Nodes acting together. Even over WAN.
Keyspace - Logical collection of Column Families. Stores
replication strategy.
Column Family (Table) - Stores rows of data
©2012 DataStax
20
- 21. Data Model Basics Rows
• Unique in column family
• Hashed
• Randomly assigned to node*
• Indexed for speed
*You pick the partitioner. Please pick random. Please. Please. Please
©2012 DataStax
21
- 22. Data Model Basics Columns
• Assigned to a row
• Column Name: 64k ByteArray
• Column Value: 2G ByteArray (!!)
• Timestamp of when set
• Optional: Expire TTL
• Dynamic
Row Column Name ...
Column Value
Timestamp
TTL
©2012 DataStax
22
- 23. Data Model Basics Wide Rows
• How wide? 2 Billion columns!!!
• No schema needed
• Row key, many columns
• Add columns as needed per row
©2012 DataStax
23
- 24. Data Model Basics Data Access
Thrift
• Cassandra's client API built entirely on top of Thrift*
• Provides for manipulation of Data Model and Data
• Almost all current clients implement this API
CQL
• Cassandra Query Language
• New binary driver as of 1.2
• Extends functionality beyond Thrift
©2012 DataStax
24
- 25. Data Model Basics Data Access
More about CQL
• Rapidly evolving spec
- Version 1 since Cassandra 0.8
- Version 2 since Cassandra 1.0
- Version 3 since Cassandra 1.1
- Final cut in 1.2
• Offers more enhanced features than thrift
• DataStax Drivers
©2012 DataStax
25
- 26. Data Model Basics Fixed schema
• Similar to a RDBMS table. Fairly fixed columns
• This example: Row key = username and is unique
• Use secondary indexes on firstname and lastname for lookup
• Adding columns with Cassandra is super easy (no downtime)
CREATE TABLE users (
username varchar,
firstname varchar,
lastname varchar,
email varchar,
password varchar,
created_date timestamp,
PRIMARY KEY (username)
);
CREATE INDEX user_firstname ON users (firstname);
CREATE INDEX user_lastname ON users (lastname);
©2012 DataStax
26
- 27. Data Model Basics One-to-many
• Videos have many comments
• Comments have many users
• Order is as inserted (Reversable if needed)
• Use getSlice() to pull some or all of the comments
CREATE TABLE comments (
videoid uuid,
username varchar,
comment_ts timestamp,
comment varchar,
PRIMARY KEY (videoid,username,comment_ts)
);
©2012 DataStax
27
- 28. Data Model Basics One-to-many pt2
• Underlying storage model is still wide rows
• CQL presents as a table
• username and comment_ts are filterable
Wide row
Time ordered
SELECT comment
FROM comments
WHERE username = ‘ctodd’
AND comment_ts > ‘2012-07-12 10:30:00’;
©2012 DataStax
28
- 29. Data Model Basics Query Tables
• No joins in Cassandra
• Filtering and scans can be expensive
• Tag is unique regardless of video
• Great for “List videos with X tag”
• Tags have to be updated in Video and Tag at the same time
• Index integrity is maintained in app logic
CREATE TABLE tag_index (
tag varchar, Powerful performance tool!
videoid varchar,
timestamp timestamp,
PRIMARY KEY (tag, videoid)
);
©2012 DataStax
29
- 30. Data Model Basics Loading data
> 1 Million rows
• BI Tools - Talend, Pentaho, JasperSoft
• Custom code - My personal favorite
• sstable loader - Only for specific file types
sstableloader -d 10.0.0.100 /home/pmcfadin/dbfiles
Requires files to be in sstable format
©2012 DataStax
30
- 31. Data Model Basics Loading data
< 1 Million rows
• Everything that worked for 1 Million +
• CQL copy command
• Loads a delimited file into a table
COPY customers(Card_ID, Registration_Date, Gender, Birth_Date)
FROM 'Customers_File.txt'
WITH HEADER=true
AND DELIMITER=’,';
©2012 DataStax
31
- 32. Cassandra 1.2 Data Access
•Collections (maps, sets, lists)Support for virtual
nodes (vnodes)Query ProfilerAtomic
batchesEnhanced JBOD supportNative binary
CQL transport (no Thrift)Parallel leveled
compactionsOff-heap bloom filters
©2012 DataStax
32
- 33. Collections
•Structure to column values
•Insert and update
• Map
• List cqlsh> CREATE TABLE users (
• Set user_id text PRIMARY KEY,
first_name text,
last_name text,
emails set<text>
);
http://www.datastax.com/dev/blog/cql3_collections
©2012 DataStax
33
- 34. Request tracing
•Automatically stored for 24h
•Full path trace cqlsh> tracing on;
Now tracing requests.
•Includes node info cqlsh:foo> INSERT INTO test (a, b) VALUES (1, 'example');
Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9
activity | timestamp | source | source_elapsed
-------------------------------------+--------------+-----------+----------------
execute_cql3_query | 00:02:37,015 | 127.0.0.1 | 0
Parsing statement | 00:02:37,015 | 127.0.0.1 | 81
Preparing statement | 00:02:37,015 | 127.0.0.1 | 273
Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540
Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779
Messsage received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63
Applying mutation | 00:02:37,016 | 127.0.0.2 | 220
Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250
Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277
Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378
Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710
Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888
Messsage received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334
Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550
Request complete | 00:02:37,017 | 127.0.0.1 | 2581
http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2
©2012 DataStax
34
- 35. Virtual Nodes (vnodes)
•Many nodes per JVM
•Tokens are auto-assigned (!!!)
•Faster...
✓repair
✓bootstrap
✓decommission
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
©2012 DataStax
35