0
Cassandra     forPython Developers     Tyler Hobbs
History    Open-sourced by Facebook   2008    Apache Incubator           2009    Top-level Apache project   2010    Da...
Strengths    Scalable    – 2x Nodes == 2x Performance    Reliable (Available)    – Replication that works    – Multi-DC ...
Strengths    Fast    – 10-30k writes/sec, 1-10k reads/sec    Analytics    – Integrated Hadoop support
Weaknesses    No ACID transactions    – Dont need these as often as youd think    Limited support for ad-hoc queries    ...
Clustering    Every node plays the same role    – No masters, slaves, or special nodes
Clustering             0      50          10      40          20             30
Clustering                       Key: “www.google.com”             0      50          10      40          20             30
Clustering                       Key: “www.google.com”             0                       md5(“www.google.com”)      50  ...
Clustering                       Key: “www.google.com”             0                       md5(“www.google.com”)      50  ...
Clustering                       Key: “www.google.com”             0                       md5(“www.google.com”)      50  ...
Clustering                          Key: “www.google.com”             0                          md5(“www.google.com”)    ...
Clustering   Client can talk to any node
Data Model    Keyspace    – A collection of Column Families    – Controls replication settings    Column Family    – Kin...
ColumnFamilies    Static    – Object data    Dynamic    – Pre-calculated query results
Static Column Families                   Users   zznate    password: *    name: Nate   driftx    password: *   name: Brand...
Dynamic Column Families                     Followingzznate    driftx:   thobbs:driftxthobbs    zznate:jbellis   driftx:  ...
Dynamic Column Families    Timeline of tweets by a user    Timeline of tweets by all of the people a    user is followin...
Pycassa    Python client library for Cassandra    Open Source (MIT License)    – www.github.com/pycassa/pycassa    User...
Installing Pycassa    easy_install pycassa    – or pip
Basic Layout    pycassa.pool    – Connection pooling    pycassa.columnfamily    – Primary module for the data API    py...
The Data API    RPC-based API    Rows are like a sorted list of (name,value)    tuples    – Like a dict, but sorted by t...
Inserting Data>>> from pycassa.pool import ConnectionPool>>> from pycassa.columnfamily import ColumnFamily>>>>>> pool = Co...
Inserting Data>>> columns = {“aaa”: 1, “ccc”: 3}>>> cf.insert(“key”, columns)>>> cf.get(“key”){“aaa”: 1, “ccc”: 3}>>>>>> #...
Fetching Data>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}>>>>>> # Get a set of columns by name>>> cf.get(“ke...
Fetching Data>>> # Get a slice of columns>>> cf.get(“key”, column_start=”bbb”,...               column_finish=”ccc”){“bbb”...
Fetching Data>>> # Get the first two columns in the row>>> cf.get(“key”, column_count=2){“aaa”: 42, “bbb”: 2}>>>>>> # Get ...
Fetching Multiple Rows>>> columns = {“col”: “val”}>>> cf.batch_insert({“k1”: columns,...                  “k2”: columns,.....
Fetching a Range of Rows>>> # Get a generator over all of the rows>>> for key, columns in cf.get_range():...     print key...
Fetching Rows by Secondary Index>>> from pycassa.index import *>>>>>> # Build up our index clause to match>>> exp = create...
Deleting Data>>>   # Delete a whole row>>>   cf.remove(“key1”)>>>>>>   # Or selectively delete columns>>>   cf.remove(“key...
Connection Management    pycassa.pool.ConnectionPool    – Takes a list of servers        • Can be any set of nodes in you...
Async Options    eventlet    – Just need to monkeypatch socket and threading    Twisted    – Use Telephus instead of Pyc...
Tyler Hobbs        @tylhobbstyler@datastax.com
Upcoming SlideShare
Loading in...5
×

Cassandra for Python Developers

3,110

Published on

A high level introduction to Apache Cassandra followed by an introduction to pycassa, the Python client library for Cassandra.

Presented at PyTexas 2011 by Tyler Hobbs.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,110
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
88
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Cassandra for Python Developers"

  1. 1. Cassandra forPython Developers Tyler Hobbs
  2. 2. History Open-sourced by Facebook 2008 Apache Incubator 2009 Top-level Apache project 2010 DataStax founded 2010
  3. 3. Strengths Scalable – 2x Nodes == 2x Performance Reliable (Available) – Replication that works – Multi-DC support – No single point of failure
  4. 4. Strengths Fast – 10-30k writes/sec, 1-10k reads/sec Analytics – Integrated Hadoop support
  5. 5. Weaknesses No ACID transactions – Dont need these as often as youd think Limited support for ad-hoc queries – Youll give these up anyway when sharding an RDBMS Generally complements another system – Not intended to be one-size-fits-all
  6. 6. Clustering Every node plays the same role – No masters, slaves, or special nodes
  7. 7. Clustering 0 50 10 40 20 30
  8. 8. Clustering Key: “www.google.com” 0 50 10 40 20 30
  9. 9. Clustering Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  10. 10. Clustering Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  11. 11. Clustering Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  12. 12. Clustering Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  13. 13. Clustering Client can talk to any node
  14. 14. Data Model Keyspace – A collection of Column Families – Controls replication settings Column Family – Kinda resembles a table
  15. 15. ColumnFamilies Static – Object data Dynamic – Pre-calculated query results
  16. 16. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  17. 17. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
  18. 18. Dynamic Column Families Timeline of tweets by a user Timeline of tweets by all of the people a user is following List of comments sorted by score List of friends grouped by state
  19. 19. Pycassa Python client library for Cassandra Open Source (MIT License) – www.github.com/pycassa/pycassa Users – Reddit – ~10k github downloads of every version
  20. 20. Installing Pycassa easy_install pycassa – or pip
  21. 21. Basic Layout pycassa.pool – Connection pooling pycassa.columnfamily – Primary module for the data API pycassa.system_manager – Schema management
  22. 22. The Data API RPC-based API Rows are like a sorted list of (name,value) tuples – Like a dict, but sorted by the names – OrderedDicts are used to preserve sorting
  23. 23. Inserting Data>>> from pycassa.pool import ConnectionPool>>> from pycassa.columnfamily import ColumnFamily>>>>>> pool = ConnectionPool(“MyKeyspace”)>>> cf = ColumnFamily(pool, “MyCF”)>>>>>> cf.insert(“key”, {“col_name”: “col_value”})>>> cf.get(“key”){“col_name”: “col_value”}
  24. 24. Inserting Data>>> columns = {“aaa”: 1, “ccc”: 3}>>> cf.insert(“key”, columns)>>> cf.get(“key”){“aaa”: 1, “ccc”: 3}>>>>>> # Updates are the same as inserts>>> cf.insert(“key”, {“aaa”: 42})>>> cf.get(“key”){“aaa”: 42, “ccc”: 3}>>>>>> # We can insert anywhere in the row>>> cf.insert(“key”, {“bbb”: 2, “ddd”: 4})>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}
  25. 25. Fetching Data>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}>>>>>> # Get a set of columns by name>>> cf.get(“key”, columns=[“bbb”, “ddd”]){“bbb”: 2, “ddd”: 4}
  26. 26. Fetching Data>>> # Get a slice of columns>>> cf.get(“key”, column_start=”bbb”,... column_finish=”ccc”){“bbb”: 2, “ccc”: 3}>>>>>> # Slice from “ccc” to the end>>> cf.get(“key”, column_start=”ccc”){“ccc”: 3, “ddd”: 4}>>>>>> # Slice from “bbb” to the beginning>>> cf.get(“key”, column_start=”bbb”,... column_reversed=True){“bbb”: 2, “aaa”: 42}
  27. 27. Fetching Data>>> # Get the first two columns in the row>>> cf.get(“key”, column_count=2){“aaa”: 42, “bbb”: 2}>>>>>> # Get the last two columns in the row>>> cf.get(“key”, column_reversed=True,... column_count=2){“ddd”: 4, “ccc”: 3}
  28. 28. Fetching Multiple Rows>>> columns = {“col”: “val”}>>> cf.batch_insert({“k1”: columns,... “k2”: columns,... “k3”: columns})>>>>>> # Get multiple rows by name>>> cf.multiget([“k1”,“k2”]){“k1”: {”col”: “val”}, “k2”: {“col”: “val”}}>>> # You can get slices of each row, too>>> cf.multiget([“k1”,“k2”], column_start=”bbb”) …
  29. 29. Fetching a Range of Rows>>> # Get a generator over all of the rows>>> for key, columns in cf.get_range():... print key, columns“k1” {”col”: “val”}“k2” {“col”: “val”}“k3” {“col”: “val”}>>> # You can get slices of each row>>> cf.get_range(column_start=”bbb”) …
  30. 30. Fetching Rows by Secondary Index>>> from pycassa.index import *>>>>>> # Build up our index clause to match>>> exp = create_index_expression(“name”, “Joe”)>>> clause = create_index_clause([exp])>>> matches = users.get_indexed_slices(clause)>>>>>> # results is a generator over matching rows>>> for key, columns in matches:... print key, columns“13” {”name”: “Joe”, “nick”: “thatguy2”}“257” {“name”: “Joe”, “nick”: “flowers”}“98” {“name”: “Joe”, “nick”: “fr0d0”}
  31. 31. Deleting Data>>> # Delete a whole row>>> cf.remove(“key1”)>>>>>> # Or selectively delete columns>>> cf.remove(“key2”, columns=[“name”, “date”])
  32. 32. Connection Management pycassa.pool.ConnectionPool – Takes a list of servers • Can be any set of nodes in your cluster – pool_size, max_retries, timeout – Automatically retries operations against other nodes • Writes are idempotent! – Individual node failures are transparent – Thread safe
  33. 33. Async Options eventlet – Just need to monkeypatch socket and threading Twisted – Use Telephus instead of Pycassa – www.github.com/driftx/telephus – Less friendly, documented, etc
  34. 34. Tyler Hobbs @tylhobbstyler@datastax.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×